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Abstract 

We show that one can approximate the least fixed point solution for a multivariate system 
of monotone probabilistic polynomial equations in time polynomial in both the encoding size of 
the system of equations and in log(l/e), where e > is the desired additive error bound of the 
solution. (The model of computation is the standard Turing machine model.) 

We use this result to resolve several open problems regarding the computational complexity of 
computing key quantities associated with some classic and heavily studied stochastic processes, 
including multi-type branching processes and stochastic context-free grammars. 

1 Introduction 

Some of the central computational problems associated with a number of classic stochastic processes 
can be rephrased as a problem of computing the non-negative least fixed point solution of an 
associated multivariate system of monotone polynomial equations. 

In particular, this is the case for computing the extinction probabilities (also called final proba- 
bilities) for multi-type branching processes (BPs), a problem which was first studied in the 1940s by 
Kolmogorov and Sevastyanov |23j . Branching processes are a basic stochastic model in probability 
theory, with applications in diverse areas ranging from population biology to the physics of nuclear 
chain reactions (see |18j for the classic theoretical text on BPs, and [22 } 117 1 128] for some of the more 
recent applied textbooks on BPs). BPs describe the stochastic evolution of a population of objects 
of distinct types. In each generation, every object a of each type T gives rise to a (possibly empty) 
multiset of objects of distinct types in the next generation, according to a given probability dis- 
tribution on such multisets associated with the type T. The extinction probability, qx, associated 
with type T is the probability that, starting with exactly one object of type T, the population will 
eventually become extinct. Computing these probabilities is fundamental to many other kinds of 
analyses for BPs (see, e.g., |18j). Such probabilities are in general irrational, even when the finite 
data describing the BP (namely, the probability distributions associated with each of the finitely 
many types T) are given by rational values (as is assumed usually for computational purposes). 
Thus, we would like to compute the probabilities approximately to desired precision. 

Another essentially equivalent problem is that of computing the probability of the language 
generated by a stochastic context-free grammar (SCFG), and more generally its termination prob- 
abilities (also called the partition function). SCFGs are a fundamental model in statistical natural 
language processing and in biological sequence analysis (see, e.g., [251 El [26] ) . A SCFG provides a 
probabilistic model for the generation of strings in a language, by associating probabilities to the 



rules of a CFG. The termination probability of a nonterminal A is the probability that a random 
derivation of the SCFG starting from A eventually terminates and generates a finite-string; the to- 
tal probability of the language of a SCFG is simply the termination probability for the start symbol 
of the SCFG. Computing these termination probabilities is again a key computational problem for 
the analysis of SCFGs, and is required for computing other quantities, for example the probability 
of generating a given string. 

Despite decades of applied work on BPs and SCFGs, as well as theoretical work on their compu- 
tational problems, no polynomial time algorithm was known for computing extinction probabilities 
for BPs, nor for termination probabilities for SCFGs, nor even for approximating them within any 
nontrivial constant: prior to this work it was not even known whether one can distinguish in P-time 
the case where the probability is close to from the case where it is close to 1. 

We now describe the kinds of nonlinear equations that have to be solved in order to compute the 
above mentioned probabilities. Consider systems of multi-variate polynomial fixed point equations 
in n variables, with n equations, of the form xi = Pi(x), i = l,...,n where x = (x\, . . . , x n ) 
denotes the vector of variables, and P{(x) is a multivariate polynomial in the variables x. We 
denote the entire system of equations by x = P(x). The system is monotone if all the coefficients 
of the polynomials are nonnegative. It is a probabilistic polynomial system (PPS) if in addition the 
coefficients of each polynomial sum to at most 1. 

It is easy to see that a system of probabilistic polynomials P{x) always maps any vector in 
[0, l] n to another vector in [0, 1]™. It thus follows, by Brouwer's fixed point theorem, that a PPS 
x = P{x) always has a solution in [0, l] n . In fact, it always has a unique least solution, q* £ [0, 1]", 
which is coordinate-wise smaller than any other non-negative solution, and which is the least fixed 
point (LFP) of the monotone operator P : [0, l] n — > [0, l] n on [0, l] n . The existence of the LFP, 
q* , is guaranteed by Tarski's fixed point theorem. From a BP or a SCFG we can construct easily 
a probabilistic polynomial system x = P(x) whose LFP q* yields precisely the vector of extinction 
probabilities for the BP, or termination probabilities for the SCFG. Indeed, the converse also 
holds: computing the extinction probabilities for a BP (termination probabilities of an SCFG) and 
computing the LFP of a system of probabilistic polynomial equations are equivalent problems. As 
we discuss below, some other stochastic models also lead to equivalent problems or to special cases. 

Previous Work. As already stated, the polynomial-time computability of these basic proba- 
bilities for multi-type branching processes and SCFGs have been longstanding open problems. In 
|13j , we studied these problems as special sub-cases of a more general class of stochastic processes 
called recursive Markov chains (RMCs), which form a natural model for analysis of probabilistic 
procedural programs with recursion, and we showed that these problems are equivalent to com- 
puting termination probabilities for the special subclass of 1-exit RMCs (1-RMC). General RMCs 
are expressively equivalent to the model of probabilistic pushdown systems studied in [6]. We 
showed that for BPs, SCFGs, and 1-RMCs, the qualitative problem of determining which proba- 
bilities are exactly 1 (or 0) can be solved in P-time, by exploiting basic results from the theory 
of branching processes. We proved however that the decision problem of determining whether the 
extinction probability of a BP (or termination probability of a SCFG or a 1-RMC) is > 1/2 is at 
least as hard as some longstanding open problems in the complexity of numerical computation, 
namely, the square-root sum problem, and a much more general decision problem (called PosSLP) 
which captures the power of unit-cost exact rational arithmetic [2], and hence it is very unlikely 
that the decision problem can be solved in P-time. For general RMCs we showed that in fact 
this hardness holds for computing any nontrivial approximation of the termination probabilities. 
No such lower bound was shown for the approximation problem for the subclass of 1-RMCs (and 
BPs and SCFGs). In terms of upper bounds, the best we knew so far, even for any nontrivial 
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approximation, was that the problem is in FIXP (which is in PSPACE), i.e., it can be reduced to 
approximating a Nash equilibrium of a 3-player game [H]. We improve drastically on this in this 
paper, resolving the problem completely, by showing we can compute these probabilities in P-time 
to any desired accuracy. 

An equivalent way to formulate the problem of computing the LFP, q* , of a PPS, x = P(x), is 
as a mathematical optimization problem: minimize: X^r=i x *> subject to: {P(x) — x < 0; x > 0}. 
This program has a unique optimal solution, which is the LFP q* . If the constraints were convex, 
the solution could be computed approximately using convex optimization methods. In general, the 
PPS constraints are not convex (e.g., X2X3 — x\ < is not a convex constraint), however for certain 
restricted subclasses of PPSs they are. This is so for backbutton processes which were introduced 
and studied by Fagin et. al. in [16] and used there to analyze random walks on the web. Backbutton 
processes constitute a restricted subclass of SCFGs (see p3]). Fagin et. al. applied semidefinite 
programming to approximate the corresponding termination probabilities for backbutton processes, 
and used this as a basis for approximating other important quantities associated with them. 

There are a number of natural iterative methods that one can try to use (and which indeed are 
used in practice) in order to solve the equations arising from BPs and SCFGs. The simplest such 
method is value iteration: starting with the vector x° = 0, iteratively compute the sequence x %+l := 

P(x l ), i = 1, The sequence always converges monotonically to the LFP q*. Unfortunately, it 

can be very slow to converge: even for the simple univariate polynomial system x = (l/2)x 2 + 1/2, 
for which q* = 1, one requires 2 i_3 iterations to exceed 1 — 1/2* -1 , i.e. to get i bits of precision 

In [13] we provided a much better method that always converges monotonically to q*. Namely, 
we showed that a decomposed variant of Newton's method can be applied to such systems of 
equations (and in fact, much more generally, to any monotone system of polynomial equations) x = 
P(x), and always converges monotonically to the LFP solution q* (if a solution exists). Optimized 
variants of this decomposed Newton's method have by now been implemented in several tools 
[30\ 126] . and they perform quite well in practice on many instances. 

The theoretical speed of convergence of Newton's method on such monotone (probabilistic) 
polynomial systems was subsequently studied in much greater detail by Esparza, Kiefer, and Lut- 
tenberger in [lOj. They showed that, even for Newton's method on PPSs, there are instances where 
exponentially many iterations of Newton's method (even with exact arithmetic in each iteration) 
are required, as a function of the encoding size of the system, in order to converge to within just 
one bit of precision of the solution q*. On the upper bound side, they showed that after some 
number of iterations in an initial phase, thereafter Newton obtains an additional bit of precision 
per iteration (this is called linear convergence in the numerical analysis literature). In the special 
case where the input system of equations is strongly connected, meaning roughly that all variables 
depend (directly or indirectly) on each other in the system of equations x = P(x), they proved an 
exponential upper bound on the number of iterations required in the initial phase as a function of 
input size. For the general case where the input system of equations is not strongly connected, they 
did not provide any upper bound as a function of the input size. In more recent work, Esparza et 
al [9] further studied probabilistic polynomial systems. They did not provide any new worst-case 
upper bounds on the behavior of Newton's method in this case, but they studied a modified method 
which is in practice more robust numerically, and they also showed that the qualitative problem of 
determining whether the LFP q* = 1 is decidable in strongly polynomial time. 

Our Results. In this paper we provide the first polynomial time algorithm for computing, to 
any desired accuracy, the least fixed point solution, q*, of probabilistic polynomial systems, and 
thus also provide the first P-time approximation algorithm for extinction probabilities of BPs, and 
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termination probabilities of SCFGs and 1-exit RMCs. The algorithm proceeds roughly as follows: 

1. We begin with a preprocessing step, in which we determine all variables which have value 
or 1 in the LFP q* and remove them from the system. 

2. On the remaining system of equations, x = P(x), with an LFP q* such that < q* < 1, 
we apply Newton's method, starting at initial vector x^ := 0. Our key result is to show that, 
once variables X{ with q* £ {0, 1} have been removed, Newton's method only requires polynomially 
many iterations (in fact, only linearly many iterations) as a function of both the encoding size of 
the equation system and of log(l/e) to converge to within additive error e > of the vector q*. To 
do this, we build on the previous works |13| , I10 | H3], and extend them with new techniques. 

3. The result in the previous step applies to the unit-cost arithmetic RAM model of computa- 
tion, where we assume that each iteration of Newton's method is carried out in exact arithmetic. 
The problem with this, of course, is that in general after only a linear number of iterations, the 
number of bits required to represent the rational numbers in Newton's method can be exponential 
in the input's encoding size. We resolve this by showing, via a careful round-off analysis, that if 
after each iteration of Newton's method the positive rational numbers in question are all rounded 
down to a suitably long but polynomial encoding length (as a function of both the input size and 
of the desired error e > 0), then the resulting "approximate" Newton iterations will still be well- 
defined and will still converge to q*, within the desired error e > 0, in polynomially (in fact linearly) 
many iterations. The correctness of the rounding relies crucially on the properties of PPSs shown 
in step 2, and it does not work in general for other types of equation systems^] 

Extinction probabilities of BPs and termination probabilities of SCFGs are basic quantities 
that are important in many other analyses of these processes. We illustrate an application of these 
results by solving in polynomial time some other important problems for SCFGs that are at least 
as hard as the termination problem. We show two results in this regard: 

(1) Given a SCFG and a string, we can compute the probability of the string to any desired 
accuracy in polynomial time. This algorithm uses the following construction: 

(2) Given an SCFG, we can compute in P-time another SCFG in Chomsky normal form (CNF) 
that is approximately equivalent in a well-defined sense. 

These are the first P-time algorithms for these problems that work for arbitrary SCFGs, in- 
cluding grammars that contain e-rules. Many tasks for SCFGs, including computation of string 
probabilities, become much easier for SCFGs in CNF, and in fact, many papers start by assuming 
that the given SCFG is in CNF. In the nonstochastic case, there are standard efficient algorithms 
for transforming any CFG to an equivalent one in CNF. However, for stochastic grammars this is 
not the case and things are much more complicated. It is known that every SCFG has an equivalent 
one in CNF [Tj, however, as remarked in PQ, their proof is nonconstructive and does not yield any 
algorithm (not even an exponential-time algorithm). Furthermore, it is possible that even though 
the given SCFG has rational rule probabilities, the probabilities for any equivalent SCFG in CNF 
must be irrational, hence they have to be computed approximately. We provide here an efficient 
P-time algorithm for computing such a CNF SCFG. To do this requires our P-time algorithm for 
SCFG termination probabilities, and requires the development of considerable additional machinery 
to handle the elimination of probabilistic e-rules and unary rules, while keeping numerical values 
polynomially bounded in size, yet ensuring that the final SCFG meets the desired accuracy bounds. 

1 In particular, there are examples of PPSs which do have q* = 1 for some i, such that this rounding method 
fails completely because of very severe ill- conditioning (see PU). Also, for quasi-birth- death (QBDs) processes, a 
stochastic model studied heavily in qeueing systems, different monotone polynomial equations can be associated with 
key probabilities, and Newton's method converges in polynomially many iterations |12j . but this rounding fails, and 
in fact it is an open problem whether the key probabilities for QBDs can be computed in P-time in the Turing model. 
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The paper is organized as follows: Section 2 gives basic definitions and background; Section 3 
addresses the solution of PPSs, showing a linear bound on the number of Newton iterations; Section 
4 shows a polynomial time bound in the Turing model; Section 5 describes briefly the applications 
to SCFGs. Due to space, most proofs and technical development are in the Appendix. 

2 Definitions and Background 

A (finite) multi-type Branching Process (BP), G = (V,R), consists of a (finite) set V = 
{S\, . . . , S n } of types, and a (finite) set R = U™ =1 Ri of rules, which are partitioned into distinct rule 
sets, Ri, associated with each type Si. Each rule r £ Ri has the form Si a r , where p r £ (0, 1], 
and a r is a finite multiset (possibly the empty multiset) whose elements are in V. Furthermore, 
for every type Si, we have YlreRi Pr = 1- The rule Si ^> a r specifies the probability with which an 
entity (or object) of type Si generates the multiset a r of offsprings in the next generation. As usual, 
rule probabilities p r are assumed to be rational for computational purposes. Multisets a r over V can 
be encoded by giving a vector v(a r ) £ N n , with the i'th coordinate v(a r )i representing the number 
of elements of type Si in the multiset a r . We assume instead that the multisets ot r are represented 
even more succinctly in sparse representation, by specifying only the non-zero coordinates of the 
vector v(a r ), encoded in binary. 

A BP, G = (V, R), defines a discrete-time stochastic (Markov) process, whose states are multisets 
over V, or equivalently elements of N n . If the state at time t is a 1 , then the next state a t+1 at time 
t + 1 is determined by independently choosing, for each object of each type Si in the multiset a 1 , a 
random rule r £ Ri of the form Si -4- a r , according to the probability p r of that rule, yielding the 
multiset a r as the "offsprings" of that object in one generation. The multiset a t+l is then given 
by the multiset union of all such offspring multisets, randomly and independently chosen for each 
object in the multiset a 1 . A trajectory (sample path) of this stochastic process, starting at time 
in initial multiset a , is a sequence a , a , a 2 , ... of multisets over V. Note that if ever the process 
reaches extinction, i.e., if ever a 1 = {} at some time t > 0, then a 1 = {} for all times t' > t. 

Very fundamental quantities associated with a BP, which are a key to many analyses of BPs, 
are its vector of extinction probabilities, q* £ [0, l] n , where q* is defined as the probability that, 
starting with initial multiset a := {Si} at time 0, i.e., starting with a single object of type Si, the 
stochastic process eventually reaches extinction, i.e., that a* = {} at some time t > 0. 

Given a BP, G = (V,R), there is a system of polynomial equations in n = \V\ variables, 
x = P(x), that we can associate with G, such that the least non- negative solution vector for 
x = P(x) is the vector of extinction probabilities q* (see, e.g., [18j H3]). Let us define these 
equations. For an n-vector of variables x = (x±, . . . , x n ), and a vector v £ N n , we use the shorthand 
x v to denote the monomial x^ 1 ...x^ 1 . Given BP G = (V,R), we define equation Xi = Pi(x) 
by: Xi = X^rG-R .Prx"^ 01 ^- This yields n polynomial equations in n variables, which we denote by 
x = P(x). It is not hard to establish that q* = P{q*)- In fact, q* is the least non-negative solution 
of x = P(x). In other words, if q' = P(q') for q' £ M> , then q' > q* > 0, i.e., q[ > q* for all 
i = 1, . . . , n. 

Note that this system of polynomial equations x = P(x) has very special properties. Namely, 
(I): the coefficients and constant of each polynomial Pi{x) = ^2 re R i PrX v ^°' r ^ are nonnegative, i.e., 
p r > for all r. Furthermore, (II): the coefficients sum to 1, i.e., YlreR- P r = ^ ^ e cau x = P( x ) 
a probabilistic polynomial system of equations (PPS) if it has properties (I) and (II) except 
that for convenience we weaken (II) and also allow (II') : YlreR Pr — 1- ^ a system of equations 
x = P(x) only satisfies (I), then we call it a monotone polynomial system of equations (MPS). 
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For any PPS, x = P(x), P{x) defines a monotone operator P : [0, l] n — > [0, l] n , i.e., if y > x > 
then P(y) > P(x). For any BP with corresponding PPS x = P(x), q* is precisely the least fixed 
point (LFP) of the monotone operator P : [0,1]™ -> [0, If (see p3]). A MPS, x = P(x), also 
defines a monotone operator P : M> — )• M™ on the non- negative orthant M> . An MPS need 
not in general have any solution in M> , but when it does so, it has a least fixed point solution 
q* = P(q*) such that < q' = P(q') implies q* < q' . 

Note that any PPS (with rational coefficients) can be obtained as the system of equations 
x = P(x) for a corresponding BP G (with rational rule probabilities), and vice versall Thus, the 
computational problem of computing the extinction probabilities of a given BP is equivalent to the 
problem of computing the least fixed point (LFP) solution q* of a given PPS, x = P(x). For a PPS 
x = P(x), we shall use |P| to denote the sum of the number n of variables and the numbers of bits 
of all the nonzero coefficients and nonzero exponents of all the polynomials in the PPS. Note that 
the encoding length of a PPS in sparse representation is at least |P| (and at most 0(|P| logn)). 

The probabilities q* can in general be irrational, and even deciding whether q* > 1/2 is as hard 
as long standing open problems, including the square-root sum problem, which are not even known 
to be in NP (see |13|). We instead want to approximate q* within a desired additive error e > 0. 
In other words, we want to compute a rational vector v' G Q n n [0, l] n such that \\q* — v'Woo < e. 

A PPS, x = P(x), is said to be in Simple Normal Form (SNF) if for every i = 1, . . . ,n, the 
polynomial Pj(x) has one of two forms: (1) Form*: Pj(x) = XjX^ is simply a quadratic monomial; or 
(2) Form + : Pi(x) is a linear expression X^eC; Pi,j x j~\~Pi,0i f° r some rational non-negative coefficients 
Pij and pifi, and some index set Ci C {1, . . . , n}, where Yljedu{o}Phj — 1- We call such a linear 
equation leaky if YljeCiU{0}Pij < 1- ^n MPS is said to be in SNF if the same conditions hold 
except we do not require 2jeCiU{o} P i J — 1- ^he following is proved in the appendix. 

Proposition 2.1 (cf. Proposition 7.3 |13|). Every PPS (MPS), x = P(x), can be transformed in 
P-time to an "equivalent" PPS (MPS, respectively), y = Q(y) in SNF form, such that \Q\ £ 0(|P|). 
More precisely, the variables x are a subset of the variables y, and y = Q(y) has LFP p* £ R> iff 
x = P(x) has LFP q* 6 R> > md projecting p* onto the x variables yields q* . 

Proposition 2.2 ([13)). There is a P-time algorithm that, given a PPS, x = P(x), overn variables, 
with LFP q* £ IR>o> determines for every i = 1, . . . , n whether q* = or q* = 1 or < q* < 1. 

Thus, for every PPS, we can detect in P-time all the variables Xj such that q*, = or q*- = 1. 
We can then remove these variables and their corresponding equation Xj = Pj(x), and substitute 
their values on the right hand sides (RHS) of the remaining equations. This yields a new PPS, 
x' = P'(x'), where its LFP solution, q'* , is < q'* < 1, which corresponds to the remaining 
coordinates of q* . 

We can thus henceforth assume, w.l.o.g., that any given PPS, x = P(x), is in SNF 
form and has an LFP solution q* such that < q* < 1. 

For a MPS or PPS, x = P(x), its variable dependency graph is defined to be the digraph 
H = (V,E), with vertices V = {x±, . . . , x n }, such that (xi,Xj) E E iff in Pi(x) = ^2 r< =R t PrX v( ' ar ^ 
there is a coefficient p r > such that v(a r )j > 0. Intuitively, (xi,xj) G E means that Xi "depends 
directly" on Xj. A MPS or PPS, x = P(x), is called strongly connected if its dependency graph 
H is strongly connected. As in [13J, for analysing PPSs we will find it very useful to decompose 
the PPS based on the strongly connected components (SCCs) of its variable dependency graph. 

2 "Leaky" PPSs where ^2 rGR .Pr < 1 can be translated easily to BPs by adding an extra dummy type S n +i, with 

rule 5„+i — > {S„+i, S n +i}, so = 0, and adding to R iy for each "leaky" i, the rule Si % {S„+i, S„+i} with 

probability p[ := (1 — ^2 reR . Pr)- The probabilities q* for the BP (ignoring q„ +1 = 0) give the LFP of the PPS . 
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3 Polynomial upper bounds for Newton's method on PPSs 



To find a solution for a differentiate system of equations F(x) = 0, in n variables, Newton's method 
uses the following iteration scheme: start with some initial vector G W 1 , and for k > let: 
x (k+i) ._ x (k) _ F'(x^)' 1 (F(x ( - k ^)), where F'(x) is the Jacobian matrix of F{x). 

Let x = P{x) be a given PPS (or MPS) in n variables. Let B(x) := P'(x) denote the Jacobian 
matrix of P{x). In other words, B{x) is an n x n matrix such that B(x)ij = ■ Using Newton 
iteration, starting at n-vector x^ := 0, yields the following iteration: 

x (fc+i) := x (k) + (/ _ s( x (fc)))-i(p(a;(fc)) _ (i) 

For a vector z G K n , assuming that matrix (/ — B(z)) is non-singular, we define a single iteration 
of Newton's method for x = P(x) on z via the following operator: 

Afp(z) :=z + (I- B(z)r l {P(z) - z) (2) 

It was shown in [13] that for any MPS, x = P(x), with LFP q* G K> , if we first find and 
remove the variables that have value in the LFP, q*, and apply a decomposed variant of Newton's 
method that decomposes the system according to the strongly connected components (SCCs) of 
the dependency graph and processes them bottom-up, then the values converge monotonically to 
q*. PPSs are a special case of MPSs, so the same applies to PPSs. In [TU], it was pointed out that 
if q* > 0, i.e., after we remove the variables Xi where q* = 0, decomposition into SCCs isn't strictly 
necessary. (Decomposition is nevertheless very useful in practice, as well as in the theoretical 
analysis, including in this paper.). Thus: 

Proposition 3.1 (cf. Theorem 6.1 of [13] and Theorem 4.1 of [TO])' Let x = P{x) be a MPS, with 
LFP q* > 0. Then starting at x^ := 0, the Newton iterations x^ k+1 ^ := Afp(x^) are well defined 
and monotonically converge to q* , i.e. limk^^x^ = q* , and x( fc+1 ) > x( fc ) > for all k > 0. 

We will actually establish an extension of this result in this paper, because in Section H] we will 
need to show that even when each iterate is suitably rounded off, the rounded Newton iterations 
are all well-defined and converge to q* . The main goal of this section is to show that for PPSs, 
x = P(x), with LFP < q* < 1, polynomially many iterations of Newton's method, using exact 
rational arithmetic, suffice, as a function of \P\ and j, to compute q* to within additive error 1/2- 7 . 
In fact, we show a much stronger linear upper bound with small explicit constants: 

Theorem 3.2 (Main Theorem of Section [3]). Let x = P(x) be any PPS in SNF form, with 
LFP q* , such that < q* < 1. If we start Newton iteration at x^ := 0, with x^ h+1 ^ := Np(x^), 
then for any integer j > the following inequality holds: \\q* — x^ +4 ' p ^ ||oo < . 

We need a sequence of Lemmas. The next two Lemmas are proved in the appendix. 

Lemma 3.3. Let x = P(x) be a MPS, with n variables, in SNF form, and let a,b G W 1 . Then: 

P(a) - m = »£±V - ») = B{ " )+ 2 B( "\ a - 5) 

Lemma 3.4. Let x = P{x) be a MPS in SNF form. Let z G 1" be any vector such that (I — B(z)) 
is non-singular, and thus Mp(z) is defined. Then: 

q*-Mp(z) = (I-B(z))- ^- B ^ (q*-z) 
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To prove their exponential upper bounds for strongly connected PPSs, |10| used the notion of 
a cone vector for the matrix B(q*), that is a vector d > such that B(q*)d < d. For a strongly 
connected MPS, x = P(x), with q* > 0, the matrix B(q*) > is irreducible, and thus has a positive 
eigenvector. They used this eigenvector as their cone vector d > 0. However, such an eigenvector 
yields only weak (exponential) bounds. Instead, we show there is a different cone vector for B(q*), 
and even for B(^(l + q*)), that works for arbitrary (not necessarily strongly-connected) PPSs: 

Lemma 3.5. If x = P(x) is a PPS in n variables, in SNF form, with LFP < q* < 1, and where 
P(x) has Jacobian B(x), then \/z E R n such that 0<z<\(l + q*): B(z)(l - q*) < (1 - <?*)• 
In particular, B{\{1 + q*))(l - q*) < (1 - q*), and B{q*)(l - q*) < (1 - q*). 

Proof. Lemma O applied to 1 and q* gives: P(l) - P(q*) = P(l) - q* = B(\{1 + q*))(l - q*). 
But note that P(l) < 1, because for any PPS, since the nonnegative coefficients of each polynomial 
Pi(x) sum to < 1, P(x) maps [0, l] n to [0, 1]™. Thus 1 - q* > P(l) - q* = B{\{1 + <f ))(1 - q*). 
Now observe that for < z < \{1 + q*), B{\{1 + q*)) > B(z) > 0, because the entries of Jacobian 
B(x) have nonnegative coefficients. Thus since (1 — q*) > 0, we have (1 — q*) > B(z)(l — q*). □ 

For a square matrix A, let p(A) denote the spectral radius of A. 

Theorem 3.6. For any PPS, x = P(x), in SNF form, if we have < q* < 1, then for all 
< z < q* , p(B(z)) < 1 and (I — B(z)) exists and is nonnegative. 

The proof (in the appendix) uses, among other things, Lemma [3.51 Note that this theorem tells 
us, in particular, that for every z (including q*), such that < z < q* , the Newton iteration Np(z) 
is well-defined. This will be important in Section [U We need the following Lemma from [10J. (To 
be self-contained, and to clarify our assumptions, we provide a short proof in the appendix.) 

Lemma 3.7 (Lemma 5.4 from |10|). Let x = P(x) be a MPS, with polynomials of degree bounded 
by 2, with LFP, q* > 0. Let B(x) denote the Jacobian matrix of P(x). For any positive vector 
d € M"o that satisfies B(q*)d < d, any positive real value A > 0, and any nonnegative vector 
z E K>q, if q* — z < Xd, and (I — B(z))^ 1 exists and is nonnegative, then q* — Mp{z) < ^d. 

For a vector b £ M. n , we shall use the following notation: 6 m i n = min, 6j, and 6 max = maxj 6j. 



Corollary 3.8. Let x = P{x) be MPS, with LFP q* > 0, and let B(x) be the Jacobian matrix for 
P{x). Suppose there is a vector d E M n , < d < 1, such that B(q*)d < d. For any positive integer 
j > 0, if we perform Newton's method starting at x^°> := 0, then: \\q* — x^~^ log2 rf ™«J) < 2~ J . 

Proof. By induction on k, we show q* — x^ < 2~ k -j^—d. For the base case, k = 0, since d > 0, 

^min 

TT—d > 1 > q* = q* - x (0) . For k > 0, apply Lemma E21 setting z := x^'V, A := 

fl m [ n ^min 

and d := d. This yields o* — < = 2^ k -^—d. Since we assume lldlloo ^ 1; w ^ have 

Z "mill 

i|2-(j-Llog a «WI) 1 d|| < 2-1 and thus \\q* - ^0- Liog 2 rf min J) m < 2 -i. □ 

"min 

Lemma 3.9. For a PPS in SNF form, with LFP q* , where < q* < 1, if we start Newton iteration 
at := 0, then: r,-i-rrw Ltz£hnm\\\ 

Proof. For d := , d min = ^~ q *J min ■ By LemmaESl B(q*)d < d. Apply Corollary ESI □ 

1 1 -L y 1 1 oo \ J- y ) max 

Lemma 3.10. For a strongly connected PPS, x = P(x), with LFP q* , where < q* < 1, for any 

> 2" 



two coordinates k,l of 1 — q* : t -t *\ 

^~l)k > 9 -(2|P|) 



(l-q*)l 
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Proof. Lemma[33]says that £(±(l+g*))(l-g*) < (1-q*). Since every entry of the vector ^(1+q*)) 
is > 1/2, every non-zero entry of the matrix -8(^(1 +<?*)) is at least 1/2 times a coefficient of some 
monomial in some polynomial Pi(x) of P(x). Moreover, B(^(l + q*)) is irreducible. Calling the 
entries of 5(±(l + <f )), 6 j ,-, we have a sequence of distinct indices, i\, 12, ■ ■ ■ , i m , with I = i\, k = i m , 
m < n, where each byy +1 > 0. (Just take the "shortest positive path" from I to k.) For any j: 

(B{\(1 + q*))(l - q*)) lj+1 > b ljlj+1 (l - ca- 
using Lemma GTS again, (1 - q*)i j+1 > bi jij+1 (l - q*) iy By simple induction: (1 - q*) k > 
(rij=i ~~ q*)i- Note that \P\ includes the encoding size of each positive coefficient of every 

polynomial Pi{x). We argued before that each biA j+1 > c,/2 for some coefficient a > of some 
monomial in Pi(x). Therefore, since each such a is a distinct coefficient that is accounted for in \P\, 
we must have fjjlj b iji j+1 > 2~(l p l +n ) > 2~( 2 I P D, and thus we have: {l-q*) k > 2-^ p \\l-q*) l . □ 

Combining Lemma 13.91 with Lemma 13.101 establishes the following: 

Theorem 3.11. For a strongly connected PPS, x = P(x) in n variables, in SNF form, with LFP 
q* , such that < q* < 1, if we start Newton iteration at x^ := 0, then: \\q* — xU+ 2 l f 1) JJ^ < 2~ 3 . 

To get a polynomial upper bound on the number of iterations of Newton's method for general 
PPSs, we can apply Lemma 13.91 combined with a Lemma in |14| (Lemma 7.2 of |14j). which 
implies that for a PPS x = P(x) with n variables, in SNF form, with LFP q* , where q* < 1, 
(1 — q*) m i n > l/2n2l p l c for some constant c. Instead, we prove the following much stronger result: 

Theorem 3.12. For a PPS, x = P(x) in n variables, in SNF form, with LFP q* , such that 
< q* < I, for alli = l,...,n: 1 - q* > 2" 4 I P L In other words, \\q*\\oo < 1 - 2~ 4 I P I. 

The proof of Theorem 13.121 is in the appendix. We thus get the Main Theorem of this section: 
Proof of Theorem\EE (Main Theorem of Sec. 3). By Lemma [3JJ1 \\q* _ x (j+r(log ^ < 

2-3. But by TheoremEHl [(log S~ffi na * )1 < [log 7=— 4 ] < [log 2 4 l p l] = 4|P|. □ 

4 Polynomial time in the standard Turing model of computation 

The previous section showed that for a PPS, x = P(x), using (4|P| + j) iterations of Newton's 
method starting at x^ := 0, we obtain q* within additive error 2~ J . However, performing even \P\ 
iterations of Newton's method exactly may not be feasible in P-time in the Turing model, because 
the encoding size of iterates x^ can become very large. Specifically, by repeated squaring, the 
rational numbers representing the iterate x^ p ^ may require encoding size exponential in \P\. 

In this section, we show that we can nevertheless approximate in P-time the LFP q* of a PPS, 
x = P(x). We do so by showing that we can round down all coordinates of each Newton iterate x^ 
to a suitable polynomial length, and still have a well-defined iteration that converges in nearly the 
same number of iterations to q* . Throughout this section we assume every PPS is in SNF form. 

Definition 4.1. ( "Rounded down Newton's method", with rounding parameter h.) Given a PPS, 
x = P{x), with LFP q* , where < q* < 1, in the "rounded down Newton's method" with integer 
rounding parameter h > 0, we compute a sequence of iteration vectors x^ k \ where the initial starting 
vector is again x^ := 0, and such that for each k > 0, given x^ k \ we compute as follows: 

1. First, compute x^ k+l ^ := Np(x^), where the Newton iteration operator Afp(x) was defined 
in equation ( Of course we need to show that all such Newton iterations are defined.) 
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2. For each coordinate i = 1, . . . , n, set x- to be equal to the maximum (non-negative) multiple 

of2~ h which is < m&x(xj k+1 \ 0). (In other words, round down x^- k+1 ^ to the nearest multiple 
of 2~ h , while making sure that the result is non-negative.) 

Theorem 4.2 (Main Theorem of Section Given a PPS, x = P{x), with LFP q* , such 
that < q* < 1, if we use the rounded down Newton's method with parameter h = j + 2 + 4|P|, 
then the iterations are all defined, for every k > we have < x^ < q* , and furthermore after 
h = j + 2 + 4|P| iterations we have: \\q* - ar^ +2+4 l p ll[| DO < 2~K 

We prove this via some lemmas. The next lemma proves that the iterations are always well- 
defined, and yield vectors such that < < q*. Note however that, unlike Newton iteration 
using exact arithmetic, we do not claim (as in Proposition 13. ip that x^ converges monotonically 
to q*. It may not. It turns out we don't need this: all we need is that < x^ < q* , for all 
k. In particular, it may not hold that P(x^) > x^. For establishing the monotone convergence 
of Newton's method on MPSs (Proposition 13 . X [) . the fact that P(x^) > x^ is key (see |13| ) . 
Indeed, note that for PPSs, once we know that (P(x^) — x^) > 0, Theorem 13.61 and the defining 
equation of Newton iteration, (pQ), already proves monotone convergence: x^' is well-defined and 
x {k+i) > x (k) > Q) for all k However, P(x^) > x 1 ^ may no longer hold after rounding down. If, 
for instance, the polynomial Pi(x) has degree 1 (i.e., has Form + ), then one can show that after 
any positive number of iterations k > 1, we will have that Pi{x^) = xj k \ So, if we are unlucky, 
rounding down each coordinate of x^ to a multiple of 2~ h could indeed give (P(x^ k+1 ^))i < xf +l \ 

Lemma 4.3. // we run the rounded down Newton method starting with x^ := on a PPS, 
x = P(x), with LFP q* , < q* < 1, then for all k > 0, x^ is well-defined and < x^ < q* . 

The next key lemma shows that the rounded version still makes good progress towards the LFP. 

Lemma 4.4. For a PPS, x = P(x), with LFP q* , such that < q* < 1, if we apply the rounded 
down Newton's method with parameter h, starting at x^ := 0, then for all j' > 0, we have: 

\\q* -xV +l ^\\ 00 <2-i' +2~ h+1+A \ p \ 

The proofs of Lemmas 14.31 and 14.41 use the results of the previous section and bound the effects 
of the rounding. The proofs are given in the Appendix. We can then show the main theorem: 

Proof of Theorem \4-2\ (Main Theorem of Sec. 4). In Lemma |4~41 let j' := j + 4|P| + 1 and h := 
j + 2 + 4|P|. We have: \\q* - xb'+ 2 + 4 l p ll |U < 2-^+ 1+4 l p l) + 2-^ +1 ) < 2^ +1 ) + 2^+^ = 2~K □ 

Corollary 4.5. Given any PPS, x = P(x), with LFP q* , we can approximate q* within additive 
error 2~ J in time polynomial in \P\ and j (in the standard Turing model of computation). More 
precisely, we can compute a vector v, < v < q* , such that \\q* — v\\oo < 1/2~ J . 

5 Application to probabilistic parsing of general SCFGs 

We briefly describe an application to some important problems for stochastic context-free grammars 
(SCFGs). For definitions, background, and a detailed treatment we refer to Appendix [Bl 

We are given a (SCFG) G = (V, S,P, S) with set V of nonterminals, set £ of terminals, set 
R of probabilistic rules with rational probabilities, and start symbol S G V. The SCFG induces 
probabilities on terminal strings, where the probability pc,w of string w € X* is the probability 
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that G derives w. A basic computational problem is: Given a SCFG G and string w, compute 
the probability pg,w of w. This probability is generally irrational, thus we want to compute it 
approximately to desired accuracy 5 > 0. We give the first polynomial-time algorithm for this 
problem that works for arbitrary SCFGs. 

Theorem 5.1. There is a polynomial-time algorithm that, given as input a SCFG G, a string 
w and a rational 5 > in binary representation, approximates the probability pc, w within 5, i.e., 
computes a value v such that \v — pg,w\ < 8. 

The heart of the algorithm involves the transformation of the given SCFG G to another "ap- 
proximately equivalent" SCFG G' with rational rule probabilities that is in Chomsky Normal Form 
(CNF). More precisely, for every SCFG G there is a SCFG G" in CNF that is equivalent to G in the 
sense that it gives the same probability to all the strings of E*. However, it may be the case that 
any such grammar must have irrational probabilities, and thus cannot be computed explicitly. Our 
algorithm computes a CNF SCFG grammar G' that has the same structure (i.e. rules) of such an 
equivalent CNF grammar G" and has rational rule probabilities that approximate the probabilities 
of G" to sufficient accuracy 5 (we say that G' 5 -approximates G") such that pc,w provides the 
desired approximation to pg,w, and in fact this holds for all strings up to any given length N. 

Theorem 5.2. There is a polynomial-time algorithm that, given a SCFG G, a natural number N 
in unary, and a rational 5 > in binary, computes a new SCFG G' in CNF that 5 -approximates 
a SCFG in CNF that is equivalent to G, and furthermore \pc,w — PG',w\ < 8 f or M strings w of 
length at most N. 

The algorithm for Theorem 15.21 involves a series of transformations. There are two complex 
steps in the series. The first is elimination of e-rules. This requires introduction of irrational rule 
probabilities if we were to preserve equivalence, and thus can only be done approximately. We effec- 
tively show that computation of the desired rule probabilities in this transformation can be reduced 
to computation of termination probabilities for certain auxiliary grammars; furthermore, the struc- 
ture of the construction has the property that the reduction essentially preserves approximations. 
The second complicated step is the elimination of unary rules. This requires the solution of certain 
linear systems whose coefficients are irrational (hence can only be approximately computed) and 
furthermore some of them may be extremely (doubly exponentially) small, which could potentially 
cause the system to be very ill-conditioned. We show that fortunately this does not happen, by a 
careful analysis of the structure of the constructed grammar and the associated system. The details 
are quite involved and are given in the appendix. 

Once we have an approximately equivalent CNF SCFG G' we can compute pc, w using a well- 
known variant of the CKY parsing algorithm, which runs in polynomial time in the unit-cost RAM 
model, but the numbers may become exponentially long. We show that we can do the computation 
approximately with sufficient accuracy to obtain a good approximation of the desired probability 
Pg,w hi P-time in the Turing model, and thus prove Theorem 15.11 

We note that Chomsky Normal Form is used as the starting point in the literature for several 
other important problems concerning SCFGs. We expect that Theorem 15.21 and the techniques 
we developed in this paper will enable the development of efficient P-time algorithms for these 
problems that work for arbitrary SCFGs. 
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A Appendix A: Missing proofs in Sections 2-4 
A.l Proof of Proposition 12.11 

Proposition [2TT1 [cf . also Proposition 7.3 [13]]. Every PPS (MPS), x = P(x), can be transformed in 
P-time to an "equivalent" PPS (MPS, respectively) , y = Q(y) in SNF form, such that \Q\ S 0(|P|). 
More precisely, the variables x are a subset of the variables y, and y = Q(y) has LFP p* £ IR> iff 
x = P(x) has LFP q* £ M> , and projecting p* onto the x variables yields q* . 

Proof. We prove we can convert any PPS (MPS), x = P{x), to SNF form by adding new auxiliary 
variables, obtaining a different system of polynomial equations y = Q{y) with \Q\ linear in \P\. 

To do this, we simply observe that we can use repeated squaring and Horner's rule to express 
any monomial x a via a circuit (straight-line program) with gates *, and with the variables Xi as 
input. Such a circuit will have size 0(m) where m is the sum of the numbers of bits of the positive 
elements in the vector a of exponents. We can then convert such a circuit to a system of equations, 
by simply replacing the original monomial x a by a new variable y, and by simply using auxiliary 
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variables in place of the gates of the circuit to "compute" the monomial x a that the variable y 
should be equal to. 

Note that by doing this every monomial on the RHS of any of the original equations Xi = 
Pi{x) will have been replaced by a single variable, and thus those original equations will now 
become Form + linear equations, and note that all internal gates of the circuit for representing 
x a , represented by a variable yi, give simply the product of two other variables, and thus their 
corresponding equations are simply of the form yi = yjyk, which constitutes a Form* equation. 

Importantly, note that the system of equations so obtained will still remain a system of mono- 
tone (and respectively, probabilistic) polynomial equations, if the original system was monotone 
(respectively, probabilistic), because each new auxiliary variable yi, that we introduce (which acts 
as a gate in the circuit for the monomial x a ), will be associated with an equation of the form 
Vi = VjUji which indeed is both a monotone and probabilistic equation. 

Furthermore, the new system of equations y = Q(y) has the property that (a) any solution 
p" G M™ of y = Q(y), when projected on to the x variables, yields a solution p' E R™ to the 
original system of equations, x = P(x), and (b) any solution q' 6 M> to the original system of 
equations x = P{x) yields a unique solution q" to the expanded system of equations, y = Q{y), 
by uniquely solving for the values of the new auxiliary variables using their equations (which are 
derived from the arithmetic circuit). 

The 0(|P|) bound that is claimed for \Q\ follows easily from the fact that the circuit representing 
each monomial x a has size 0(m), where m is the sum of the numbers of bits of the positive elements 
in the vector a. □ 

A. 2 Proof of Lemma 13.31 

Lemma 13.31 Let x = P(x) be a MPS, with n variables, in SNF form, and let a,b£ W 1 . Then: 
P(a) - m = B(!±*Ka - f>) = m±m {a _ b) 



Proof. Let the function / : R -)• R n be given by f(t) := ta + (1 - t)b = b + t(a - b). Define 
G(t):=P(f(t)). 

From the fundamental theorem of calculus, and using the matrix form of the chain rule from 
multi- variable calculus (see, e.g., [3j Section 12.10), we have: 

P(a) - P{b) = G(l) - G?(0) = f B(f(t))(a-b)dt 

Jo 

By linearity, we can just take out (a — b) from the integral as a constant, and we get: 

P{a) - P(b) = { [ B{ta + (1 - t)b) dt)(a - b) 
Jo 

We need to show that 

,a + b B(a) + B(b) 



f 1 < 

/ B(ta+ (1 -t)b)dt = B(- 

Jo 



Since all monomials in P(x) have degree at most 2, each entry of the Jacobian matrix B(x) is a 
polynomial of degree 1 over variables in x. For any integers i,j, with < i < n, < j < n, there 
are thus real values a and /3 with 

(B(ta + (1 - t)b))ij =a + fit 
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Then 

at = a + — 



( f B(ta + (1 - t)b) dt)ij = f (a + fit) 
Jo Jo 



g(o)+g(6) 1 5 



□ 



A. 3 Proof of Lemma 13.41 

Lemma 13.41 Let x = P(x) be a MPS in SNF form. Let z G W 1 be any vector such that (I — B(z)) 
is non-singular, and thus Afp(z) is defined. Then^: 

q* - M P {z) = (I- B(z))-i B ^- B ( Z \ q * - z) 



Proof. Lemma [3T3l applied to q* and z, gives: q* — P(z) = M£J_L£ki (g* — £). Rearranging, we get: 

m .^ {I .m^m w . z) (3) 

Replacing (P(z) — z) in equation by the right hand side of equation ([3]) and subtracting both 
sides of ([2]) from q*, gives: 

q*-M P {z) = (q* -z)-(I- B{z)Y\l - B{q * ] + B{Z) W ~ z) 

= (L - B(z)T\l - B(z))(q* -z)-(L- B(z))-\L - B ^) + B ( z ) ){q * _ z) 
= (I- B(z))-\(I - B{z)) - (I - B{(f) + B{z) )W ~ z) 



□ 



A.4 Proof of Theorem f3T6l 



Theorem 13.61 For any PPS, x = P(x), in SNF form, if we have < q* < 1, then for all 
< z < q* , p(B(z)) < 1 and (I — B(z))~ 1 exists and is nonnegative. 

Proof. For any square matrix A, let p(A) denote the spectral radius of A. We need the following 
basic fact: 

Lemma A.l (see, e.g., |20J). If A is a square matrix with p(A) < 1 then (I — A) is non-singular, 
the series ^2kL A k converges, and (I — A)^ 1 = Yl'kLo^- 



3 Our proof of this does not use the fact that x = P(x) is a MPS. We only use the fact that q* is some solution to 
x — P(x), that ATp(z) is well-defined, and that P(x) consists of polynomials of degree bounded by at most 2. 
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For all < z < q* , B(z) is a nonnegative matrix, and since the entries of the Jacobian matrix 
B(x) have nonnegative coefficients, B{x) is monotone in x, i.e., if < z < q* , then < B{z) < 
B(q*), and thus by basic facts about non-negative matrices p(B(z)) < p(B(q*)). Thus by Lemma 
lA.ll it suffices to establish that p(B{q*)) < 1. We will first prove this for strongly connected PPSs: 

Lemma A. 2. For any strongly connected PPS, x = P{x), in SNF form with LFP q* , such that 
0< q* < 1, we have p(B(q*)) < 1. 

Proof. If the Jacobian B{x) is constant, then B(q*) = B(l) = B. In this case, B is actually an 
irreducible substochastic matrix, and since we have removed all variables X{ such that q* = 0, 
it is easy to see that some polynomial Pi{x) must have contained a positive constant term, and 
therefore, in the (constant) Jacobian matrix B there is some row whose entries sum to < 1. Since 
B is also irreducible, we then clearly have that lim m _ >00 B m = 0. But this is equivalent to saying 
that p{B) < 1. Thus we can assume that the Jacobian B(x) is non-constant. By Lemma 13.51 

B(±(l+q*))(l-q*)<(l-q*) 

We have 1 — q* > 0, and B(^(l + q*)) > 0. Thus, by induction, for any positive integer power k, 
we have 

B(±(l + q*)) k (l-q*)<(l-q*) (4) 

Now, since B(x) is non-constant, and B(x) is monotone in x, and since q* < ^(1 + q*), we have 
B(q*) < B(^(l +(/*)) and furthermore there is some entry (i,j) such that B(q*)ij < B(^(l+q*))i : j, 
it follows that: 

(B(0(1 " <f))i < (B{\(1 + g*))(l - q*))i < (1 - q*\ 

Therefore, since B{q*) is irreducible, it follows that for any coordinate r there exists a power k < n 
such that (B(q*) k (l - q*)) r < (1 - q*) r . Therefore, B(q*) n (l - q*) < (1 - q*). Thus, there exists 
some < P < 1, such that B(q*) n (l — q*) < /3(1 — q*). Thus, by induction on m, for all m > 1, 
we have B(q*) nm (l - q*) < /3 m (l - q*). But lim^oo (3 m = 0, and thus since (1 - q*) > 0, it must 
be the case that limm^.^ B(q*) nm = (in all coordinates). But this last statement is equivalent to 
saying that p(B(q*)) < 1. □ 

Now we can proceed to arbitrary PPSs. We want to show that p{B{q*)) < 1. Consider an 
eigenvector v 6 M> , v ^ 0, of B(q*), associated with the eigenvalue p(B(q*)), with B(q*)v = 
p{B{q*))v. Such an eigenvector exists by standard fact in Perron-Frobenius theory (see, e.g., 
Theorem 8.3.1 [2D]). 

Consider any subset S C {1, . . . ,n} of variable indices, and let xs = Ps(xs, xd s ) denote the 
subsystem of x = P(x) associated with the vector x$ of variables in set S, where xd s denotes the 
variables not in 5*. Note that xs = Ps{xs-,q*D s ) 1S itself a PPS. We call S strongly connected if 
xs = Psjxs, Q p s ) i s a strongly connected PPS. 

By Lemma [A.2l for any such strongly connected PPS given by indices S, if we define its Jacobian 
by Bs(x), then p(Bs(q*)) < 1. If S defines a bottom strongly connected component that depends 
on no other components in the system x = P(x), then we would have that Bs(q*)vs = p(B(q*))vs 
where vs is the subvector of v with coordinates in S. Unfortunately vs might in general be the zero 
vector. However, if we take S to be a strongly connected component that has vs 7^ and such that 
the SCC S only depends on SCCs S' with vs* = 0, then we still have Bs(q*)vs = p(B(q*))vs- Thus, 
by another standard fact from Perron-Frobenius theory (see Theorem 8.3.2 of [2Q]), p(Bs(q*)) > 
p(B(q*)). But since p(B s {q*)) < 1, this implies p(B(q*)) < 1. □ 
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A. 5 Proof of Lemma 13.71 



Lemma 13.71 [Lemma 5.4 from [10]]. Let x = P{x) be a MPS, with polynomials of degree bounded 
by 2, with LFP, q* > 0. Let B{x) denoted the Jacobian matrix of P(x). For any positive vector 
d £ R™g that satisfies B{q*)d < d, any positive real value A > 0, and any nonnegative vector 
z £ K> > if q* — z < Xd, and (I — B(z))~ 1 exists and is nonnegative, then q* — Afp(z) < ^d. 

Proof. By Lemma El q* ~ M P {z) = (I - B(z))- l \(B(q*) - B(z))(q* - z). Note that matrix 
(I — B(z))^ 1 ^(B(q*) — B(z)) is nonnegative: we assumed (J — B(z))^ 1 > and the positive 
coefficients in P{x) and in B(x) mean {B{q*) — B{z)) > 0. This and the assumption that q* — z < Xd 
yields: q* - Mp{z) < (I — B(z)y 1 ^(B(q*) - B(z))Xd. We can rearrange as follows: 

q*-M P (z) < (I - B{z))- ll -(B{q*) - B(z))Xd 

= (I- B{z))- ll -{(L - B(z)) -(I- B(q*)))Xd 

= ^(J- (L-B(z))- 1 (I-B(q*)))d 

= ^d- ^(J - B(z))-\L - B(q*))d 

If we can show that ^(1 — B(z))~ 1 (I — B(q*))d > 0, we are done. By assumption: (I — B(q*))d > 0, 
and since we assumed (J - B(z))^ 1 > and A > 0, we have: |(7 - B{z))~ 1 {L - B{q*))d > 0. □ 

A.6 Proof of Theorem I37L21 

Recall again that we assume that the PPS, x = P(x), is in SNF form, where each equation 
Xi = Pi{x) is either of the form of the form Xi = YljPiJ x j +Pi,o- There is one 

equation for each variable. If n is the number of variables, we can assume w.l.o.g. that \P\ > 3n 
(i.e. the input has at least 3 bits per variable). 

We know that the ratio of largest and smallest non-zero components of 1 — q* is smaller than 
2 2 ' p ' in the strongly connected case (Lemma 13. 1QH . In the general case, two variables may not 
depend on each other, even indirectly. Nevertheless, we can establish a good upper bound on 
coordinates of q* < 1. As before, we start with the strongly connected case: 

Theorem A. 3. Given a strongly connected PPS, x = P(x), with P(l) = 1, with LFP q* , such 
that < q* < 1, and with rational coefficients, then 

q*<l- 2- 3 l p l 

for some 1 < i < n. 

Proof. Consider the vector {1 — B(l))(l — q*). As P(l) = 1, by Lemma 13.31 we have 
B{\{1 + <f ))(1 - q*) = 1 - q* and so 

(B(l) - J)(l - q*) = (5(1) - B{\{1 + g*)))(l - q*) 

This is zero except for coordinates of Form* as rows of f?(|(l + <?*)) and B(l) that correspond to 
Form + equations are identical. If we have an expression of Form*, (P(x))i = XjXk, then 

= (fl(i)-s(i(i + 0))(i -«*))< 

= (1/2)(1 - q * k )(l - q*) + (1/2)(1 - q*)(l - q* k ) 
= (l-?fc)(l-?i) 



16 



Consequently: 

- S(l))(l-OIU< 11(1 -q*)\\lo (5) 
Now suppose that (I — B(l)) is non-singular. In that case, we have that: 

l- q * = (I-B(l))- 1 (I-B(l))(l- q *) 

111 - g*||oo < ||(I- B^IUKI - B(l))(l - g*)^ 



ll-glloo^lKJ-BCl))- 1 

1 



g*)foo 



111 - ||(i -fl(i))-i[U (6) 

where || • ||oo on matrices is the induced norm of || • ||oo on vectors. ||^4||oo for an n x m matrix A 
with entries is the maximum absolute value row sum max" =1 X^j=i \ a ij\- 

So an upper bound on ||(/ — 1 ) ) — 1 1| oo will give the lower bound on || 1 — g*||oo we are looking 

for. 

Lemma A. 4. Let A be a non-singular n x n matrix with rational entries. If the product of the 
denominators of all these entries is m, then 

H^Hoo < nm||A||£, 
Proof. The i, jth entry of A -1 satisfies: 

det(Mi,) 



ij 



det{A) 



where Mij is the i,j'th minor of A, made by deleting row i and column j. H-^^jHoo — ||-^||oo ^ 
we've removed entries from rows. We always have |det(Mjj)| < ||My||^ (see, e.g., [20] page 351), 
so: 

II A \\ n 

Meanwhile det(^4) is a non-zero rational number (because by assumption A is non-singular). If we 
consider the expansion for the determinant det(A) = sgncr]^[™ =1 a irT u\, then the denominator of 
each term Yi7=i a icr(i) 1S a product of denominators of distinct entries a ia u\ and therefore divides 
m. Since every term can thus be rewritten with denominator m, the sum can also be written with 
denominator m, and therefore |det(^4)| > Thus, plugging into inequality (J7|), we have: 



\{A- l )ij\ < m\\A\ 
Taking the maximum row sum HA -1 ^, 



n 

oo 



\A < nmPH^ 



□ 



If we take (I — B{\)) to be the matrix A of Lemma |A.4|, then noting that the product of all the 
denominators in (/ — B(l)) is at most 2' p ', this gives: 

||(/-i?(l)r 1 ||oo<n2l p l||(/- J B(l))||S 
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Of course — £?(l))||oc < 1 + ||-B(l)||oo <3 (note that here we are using the fact that the system 
is in SNF normal form). Thus 

||(/- J B(l))- 1 || 00 <3"n2l p l 
Using inequality ([6]), and since as discussed, w.l.o.g., \P\ > 3n > nlog3 + logn, this gives: 

[ll-g'IU > -2-^3- n > 2- 2 l p l 
n 

Now consider the other case where (/ — B{\)) is singular. We can look for a small solution v to: 



(I-B(l))v = (I-B(l))(l-q*) (8) 

Lemma A. 5. Suppose we have an equation Ax = b, with A a singular n x n matrix, b a non-zero 
vector, and we know that Ax = b has a solution. Then it must have a solution AA'~ 1 b = b where 
A' is a non-singular matrix generated from A by replacing some rows with rows that have a single 
1 entry and the rest 0. 

Proof. If A has rank r < n, then there are linearly independent vectors a\,a2, ■ ■ ■ ,a r such that 
af, a\, . . . , a T are rows of A and other rows of A are linear combinations of these. Let ei, ■ ■ ■ , e n 
be the canonical basis of R n , i.e. each e« has ith coordinate 1 and the rest 0. By the well known fact 
that the set of linearly independent subsets of a vector space form a matroid, and in particular satisfy 
the exchange property of a matroid (see any good linear algebra or combinatorics text, e.g,. [7J, 
Proposition 12.8.2) , we know there is a basis for R n of the form {a±, 02, ... , a r , ej r+1 , ej r+2 , . . . , ej n } 
for some choice of i r +i,i r +2 ; ■ ■ ■ ,in- We form a matrix A' with elements of this basis as rows by 
starting with A and keeping r rows corresponding to of, a^, . . . and replacing the others in 
some order with ej r , ef r+2 , . . . , e[ n . Specifically, there is a permutation a of {1, . . . , n} such that 
if 1 < k < r, the a{k,)\\i row of A' and A are and if r < k < n, the o"(/c)'th row of A' is e?\ 

A' is non-singular since its rows form a basis of W 1 . It remains to show that AA'~ 1 b = b. Since 
Ax = b has a solution and the set R of rows af, . . . , aj spans the row space of A, every equation 
corresponding to a row of Ax = b is a linear combination of the r equations corresponding to the 
rows in R. Therefore, if x any vector that satisfies the r equations corresponding to the rows in 
R then it satisfies all the equations of Ax = b. The vector A' _1 b satisfies these r equations by the 
definition of A'. Therefore, AA'^b = b. □ 

We can replace some rows of (I — B(l)) to get an A' using this Lemma and then use Lemma 
I A. 41 on 

v> = A'~\l-B{l)){l-q*) 

We still have ||^4/||oo < 3 and the product of all the denominators of non-zero entries is smaller than 
2l p L As for ||(I - ^(l))- 1 !!^ before: 

Halloo < 3 n n2l p l 

Now, using inequality ([5]), we have 

M|oo<3^2l p l||(l-<f)||L (9) 

Now by equation flSJ), we have that (I — B — q*) — v') = 0. Thus (1 — q*) — v' is an eigenvector 
of -B(l) with eigenvalue 1. But we know that B(l) is nonnegative, irreducible, and has spectral 
radius bigger than 1 (because q* < 1 by assumption, see e.g., [13] proof of Theorem 8.1). Thus 
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Perron-Frobenius theory (e.g., see Corollary 8.1.29 in [20]) gives us that (1 — q*) — v[ is not a positive 
vector (because the only positive eigenvectors are associated with the top eigenvalue). Thus some 
coordinate i has: 

v[ > 1 - q* 

Thus, by inequality ([9]), we have: 

l-q*< 3 n n2l p l||(l-?*)|^ 
but the proof of Lemma 13.101 gave that: 

(l-g*)2 |P|+ ">||(l-?*)||oo 
Combining these inequalities, we have 

1-q* < 3 n n2^\\(l - q*)\\l 

< 3 m n2l P l(l- (7 *)2 |P|+n ||(l-^)||oo 

Dividing both sides by (1 — q*), we have that: 

IKi-Olloo > 



6 n n2 2 l p l 
> 2" 3 l p l 



□ 



Theorem A. 6. Given x = P(x), a general PPS in SNF normal form with rational coefficients and 
with LFP, < q* < 1, then 

q*<l- 2" 4 l p l 

for all 1 < i <n. 
Proof. 

Lemma A. 7. Any variable X{ either depends (directly or indirectly^ on a variable in a bottom 
SCC S such that Ps(l) = 1 (meaning there is no directly "leaking" variable in that SCC), or it 
depends (directly or indirectly) on some variable Xj of Form + with P(x)j = Pjfl + Yl r j=iPi,j x j where 
Y?jLoPi,j < 1 (thus, a leaky variable). 

Proof. Suppose that in the set of variables x% depends on, Di, every variable of Form + , Xj, with 
P(x)j = pjfi + YHt=iPj,k x k has Y^JLoPij = 1- Then we can verify that Pjj.(l) = 1. Di contains 
some bottom SCC S C D t . For this SCC P s (l) = 1 □ 

Suppose that Xj is of Form + with P(x)j = Pj t o + Ylk=l Pj,k x k where YskLoPj,k < !• Then 
q*j = P(q*)j has q* < Ylk=oPj,k- 1 ~~ SfcLoPj.fc * s a rational with a denominator smaller than the 
product of the denominators of all the pj^. We have: 

m 

l-^p, fe >2-l p l 

fc=0 



meaning that in the dependency graph the other variable's node can be reached from the node corresponding to 

Xi. 
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Thus in such a case: 

q * < i _ 2 -l p l 

Lemma IA.7I says that any Xi either depends on such a variable, or on a variable to which 
Theorem IA.3I applies. That is, Xi depends on some Xj with 



< 1 - 2 



-3|P| 



There is some sequence xi t , xi 2 , . . . , xi m with l\ = j, I2 — i and for every < k < m, P(xi k+1 ) 
contains a term with xi k+1 . If xi k+1 has Form*, then q* < <Z/* • If ^z fe+ i has Form + , then 1 — <zf > 
Pifc+i.i* C 1 ~ B y an eas y induction: 

1 -</*>( n p/ fc+1J / fc )(i-g*) 

ic; fc has Form+ 

Again, |P| is at least the number of bits describing these rationals Pi k+1) i k , and thus 

1 -q*> 2-^(1 -q*) 
Since we already know that q* < 1 — 2~ 3 l p l, i.e., that (1 — q*) > 2~ 3 l p l, we obtain: 

l-g*>2-l p l2- 3 l p l=2- 4 l p l 
This completes the proof of the theorem. □ 



A. 7 Proof of Lemma 14.31 

Lemma 14.31 If we run the rounded down Newton method starting with x^ := on a PPS, 
x = P{x), with LFP q* , < q* < 1, then for all k > 0, x^ is well-defined and < x^ < q* . 

Proof. We prove this by induction on k. The base case a;M = is immediate. Suppose the claim 
holds for k and thus < x^ < q*. Lemma 13.41 tells us that 

q* _ x {k+i} = {I _ B^))- 1 ^ ~ 2 B{x[k]) (q* - xW) 

Now the fact that < x^ < q* yields that each of the following inequalities hold: (q* — x^) > 0, 
B(q*) — B(x^) > 0. Furthermore, by Theorem 13.61 we have that p(B(x^)) < 1, and thus that 
(I — B(x^)) is non-singular and (7 — B(x^))^ 1 > 0. We thus conclude that q* — x^ k+1 ^ > 0, i.e., 

that a;W < q*. The rounding down ensures that < xf +l ^ < x\ k+l ^ unless x\ k+l ^ < 0, in which 

case xf +1 ^ = 0. in both cases, we have that < < q* . So we are done by induction. □ 

A. 8 Proof of Lemma [4.41 

Lemma 14.41 For a PPS, x = P{x), with LFP q* , such that < q* < 1, if we apply the rounded 
down Newton's method with parameter h, starting at := 0, then for all f > ; we have: 

lk*-^" +1 ]|| 00 <2^' + 2- h+1+4 l p l 
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Proof. Since x^ := 0: 



g*<i< fl _L , (i-O (io) 

V- 1 - H imm 



For any k > 0, if q* — x 1 ^ < A(l — q*), then by Lemma 13.71 we have: 



<?*-^ fc+1 ><(~)(l-<f) (11) 
Observe that after every iteration k > 0, in every coordinate i we have: 

x [k) > x {k} _ 2 -h (12) 

This holds simply because we are rounding down x] by at most 2 , unless it is negative in which 
case x i = > x\ . Combining the two inequalities (fllj) and (pL2|) yields the following inequality: 



9* - *[ fc+1 l < (£)(1 - O + 2" h l < + _ 2 "" )(1 - «*) 

Taking inequality (fTUj) as the base case (with A = . )> by induction on fc, for all A; > 0: 

fe 

q* - x^ < {2- k + £ 2-(^)) ._ 1 . (1 - O 
i=0 1 9 jmin 

But 2-( ft+i ) < 2- ft+1 and J' 1 ^.? 00 < 77^ — < 2 4 l p l, by Theorem [3T2J Thus: 

q*- x l k +n < ( 2 - fe + 2- /l+1 )2 4 l p ll 
Clearly, we have q* — > for all k. Thus we have shown that for all k > 0: 
\\ q * _ x^lU < (2" fc + 2- h+1 )2 4 l p l = 2~ k + 2- fc+1+4 l p l . 



□ 



A. 9 Proof of Corollary 14.51 

Corollary 14.51 Given any PPS, x = P(x), with LFP q* , we can approximate q* within additive 
error 2~ J in time polynomial in |P| and j (in the standard Turing model of computation). More 
precisely, we can compute a vector v < q* such that v £ [0, l] n and \\q* — v\\oo < 1/2~ J . 

Proof. Firstly, by Propositions 12,11 and 12,21 we can assume x = P(x) is in SNF form, and that 
< q* < 1. By Theorem 14, 2\ the rounded down Newton's method with parameter h = j + 2 + 4|P|, 
for h = j + 2 + 4|P| iterations, computes a rational vector v = x^ such that v € [0, l] ra , and 
Ik* -u||oq < l/2~ h . 

Furthermore, for all fc, with < k < h, x^ has encoding size polynomial in |P| and j. We then 
simply need to note that all the linear algebra operations, that is: matrix multiplication, addition, 
and matrix inversion, required in a single iteration of Newton's method, can be performed exactly 
on rational inputs in polynomial time and yield rational results with a polynomial size. □ 
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B Appendix B: Application to parsing for general SCFGs 



In this section, we use our P-time algorithm for approximating the extinction probabilities q* 
of BPs, equivalently the termination probabilities of a stochastic context-free grammar (SCFG), 
a.k.a., the partition function of an SCFG, in order to provide the first P-time algorithm for solving 
an approximate version of a key probabilistic parsing problem, with respect to any given SCFG, 
including grammars that contain e rules, i.e., rules of the form A A e, where e denotes the empty 
string and A is an arbitrary nonterminal of the grammar. 

String Probability Problem: Given a SCFG G, and a finite string w £ S* over the terminal 
alphabet S of G, compute the probability, pg,w, that the stochastic grammar G generates the finite 
string w. 

For SCFGs that are already in Chomsky Normal Form (CNF) there is a well known dynamic 
programming algorithm for this problem (see, e.g., [25] ). This is based on a direct extension to the 
probabilistic setting of the classic Cocke-Kasami- Younger (CKY) dynamic programming algorithm 
for parsing, i.e., determining whether an ordinary context-free grammar in CNF form can generate 
a given string w (see, e.g., [19]). 

As is well known, any ordinary (non-stochastic) context-free grammar can be converted to one 
in CNF form that generates exactly the same set of strings. 

However, the situation for SCFGs is more subtle. It is known that for every SCFG, G, there 
exists another SCFG, G' , that is in CNF form, which has the same probability of generating any 
finite string. This was shown by Abney, McAllester, and Pereira in pQ. As mentioned in [T] their 
proof of this is nonconstructive and yielded no algorithm to obtain G' from G. Moreover, as we 
shall see, when G contains only rules with rational probabilities, it can nevertheless be the case 
that there does not exist any G' in CNF which also has only rational rule probabilities and which 
generates every string with the same probability as G. In other words, the claim that every SCFG 
G with rational rule probabilities is "equivalent" to an SCFG G' in CNF form form only holds in 
general if G' is allowed to have irrational rule probabilities (even though G does not). 

We shall nevertheless show that both these issues can be overcome: (1) the nonconstructive 
nature of the prior existence arguments, and (2) the fact that CNF form SCFGs must in general have 
irrational rule probabilities. In fact, we shall give a constructive transformation from any SCFG to 
one in CNF form, and we shall show that given any SCFG G with rational rule probabilities, it is 
possible to compute in P-time a new CNF SCFG, G', which also has only rational rule probabilities, 
and which suitably approximates G. Our proof of this shall make crucial use of our P-time algorithm 
for approximating the LFP q* of a PPS, i.e., approximating termination probabilities for arbitrary 
SCFGs. 

When an SCFG G is already in CNF form, assuming its rule probabilities are rational, proba- 
bilistic versions of the CKY algorithm can be used to compute the exact and rational probabilities 
Pg,w These algorithms use only polynomially many {+, *}-arithmetic operations over the rule 
probabilities of the SCFG G, and thus they run in P-time in the unit-cost arithmetic RAM model 
of computation. Variants of these algorithms can be made to work more generally, when the SCFG 
is not in CNF form but contains no e-rules. However, for general SCFGs that do contain arbitrary 
e-rules, these algorithms do not work. 

Let us see why, unlike ordinary CFGs, it is not possible to "convert" an arbitrary SCFG with 
e-rules and rational rule probabilities to an "exactly equivalent" SCFG in CNF form, without 
irrational rule probabilities. This follows easily from the fact that the total parse probabilities pg,w 
can in general be irrational, even when all rule probabilities of G are rational, and therefore we 
can not possibly compute pc,w exactly using only finitely many {+, *}-arithmetic operations over 
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the rational rule probabilities. To see that pg,w can be irrational, consider the following simple 

grammar: S -4 aA, A ^> AA, A ^ e, A ^> NN, and N -4 NN. It is easy to see that the 
probability that this grammar generates the string a is precisely the probability of termination 
starting at nonterminal A, which is the least non-negative solution of x = (l/2)x 2 + 1/4, which is 
1 - 1/V2. 

As we saw, the total probability of G generating w, pg,w, may be irrational for SCFGs G 
with e-rules. In typical applications one will not need to compute pc,w "exactly". It will suffice to 
approximate it "well" , or to answer questions about it such as whether pg,w ^ f for a given rational 
probability r G [0, 1]. It is easy to see that these questions are at least as hard as the corresponding 
questions for the termination probabilities of a SCFG. 

Proposition B.l. Given a SCFG G (with rational probabilities) we can construct in polynomial 
time another SCFG H such that the termination probability pc of G is equal to the probability Ph,e 
that H generates the empty string e (or any other string w). Hence, the problem of deciding if 
PH,e > 1/2 for a given SCFG H is SQRT-SUM-hard and PosSLP-hard. 

Proof. Given a SCFG G, remove all the terminal symbols from the right hand sides of all the rules, 
and let H be the resulting SCFG. Clearly, there is a 1-to-l correspondence between the derivations 
of G and H. A derivation of G derives a terminal string iff the corresponding derivation of H 
derives the empty string e. Therefore, pc = PH,e- The SQRT-SUM-hardness and PosSLP-hardness 
of the problem of deciding whether pu,e > 1/2 follows from the hardness of the analogous question 
Pq > 1/2? for the termination probability (|13j). 

We can modify the reduction, if desired, to show the same result for any string w instead of e: 
just add to H a new start nonterminal S' , with the rule S' — > Sw. □ 

Since deciding whether pg,w ^ P is hard, we focus on approximating the probabilities pq w . The 
proposition implies that the problem of approximating the probability pc,w for a given SCFG G 
and string w is at least as hard as the problem of approximating the termination probability of a 
given SCFG. We will show in this section the converse: As we shall see, we will be able to use our 
P-time approximation algorithm for SCFG termination probabilities, combined with other ideas, 
to approximate pg,w in P-time. It is important to note that all of our P-time algorithms are in the 
standard Turing model of computation, not in the more powerful unit-cost arithmetic RAM model 
of computation. 

Computation of the total probabilities pg,w forms a key ingredient in the well known inside- 
outside algorithm [U [24] for learning the maximum likelihood parameters of an SCFG by using 
example parses as training instances. Namely, the inside subroutine of the inside-outside algorithm 
is precisely a subroutine for computing pc,w However, as mentioned, the well known CKY-based 
dynamic programming algorithm for computing pc,w applies only to stochastic grammars that are 
already in CNF form, and in particular to grammars that have no e rules, or else have only one e 
rule associated with the start nonterminal S, where S can not appear on the right hand side (RHS) 
of any other rule. 

The inside-outside algorithm can be viewed as an extension of the Baum- Welch forward- 
backward algorithm for finite-state hidden Markov models (or, more generally, a version of the 
EM algorithm for finding maximum likelihood parameters for statistical models), to the setting of 
SCFGs with hidden parameters. It is used heavily both in statistical NLP ([25]) and in biological 
sequence analysis (see [8]) for learning the parameters of a stochastic grammar based on parsed 
training instances. In the case of biological sequence analysis, SCFGs are used for predicting the 
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secondary structure (i.e., two dimensional folding structure) of RNA molecules based on their nu- 
cleic acid sequence, where a given parse of a sequence corresponds to a particular folding pattern. 
The parameters of the SCFG are learned using given folded RNA strands as training instances (see, 
e.g., [29j|8]). Some standard and natural grammars used for analysing RNA secondary structure 
do contain e-rules, and are not in CNF form (see, e.g., the RNA grammar in [8], on page 273). 
For these grammars, typically what researchers do is to devise tailor-made algorithms specific to 
the grammar for computing probabilities like Pg,w R is clearly desirable to instead have a ver- 
sion of the inside-outside algorithm, as well as versions of other algorithms related to probabilistic 
parsing, that are applicable on arbitrary SCFGs, including those with epsi/on-rules, since in many 
applications the most natural grammar may not be in CNF form and may require e-rules. 

Note that, as mentioned, in general, it is not true that any SCFG G with rational rule proba- 
bilities can be converted to a CNF form SCFG G' with rational rule probabilities which generates 
exactly the same probability distribution on strings. To obtain exactly the same probability distri- 
bution on strings, the CNF grammar G' may need to contain rules with irrational probabilities. 

We now begin a detailed formal treatment. Recall that an SCFG G = (V,£,i?, S") consists of 
a finite set V of nonterminals, a start nonterminal S G V , a finite set £ of alphabet symbols, and 
a finite list of rules, R = (r\, . . . , r^j, where each rule rj is specified by a triple (A,p, 7) which we 
shall denote by A — )■ 7, where A E V is a nonterminal, p S [0, 1] is the probability associated with 
this rule, and 7 S (FUS)* is a possibly empty string of terminals and nonterminals. A rule of the 
form A — >■ e, where e is the empty string, is called an e-rule. 

For technical convenience, we allow for the possibility that two distinct rules T{ and rj, i 7^ j, 
may nevertheless correspond to exactly the same triple (A,p, 7), and that they may have rules with 
both identical left hand side (LHS) and right hand side (RHS). For these reasons, we distinguish 
different rules and rj by their indices % and j, and that is why R is not viewed as a set, but a 
list of rules. 

For a rule r £ R whose corresponding triple is (A,p, 7), we define LHS{r) := A, p(r) := p, 
and RHS(r) := 7. For a rule r, where LHS(r) = A, we say that rule r is associated with 
the nonterminal A. Let Ra denote the set of rules associated with nonterminal A. We call G a 
stochastic (or probabilistic) context-free grammar (SCFG or PCFG), if for every nonterminal A G V, 
we have pjy < 1, where pa is defined as the sum of the probabilities of rules in Ra, i.e.: 

PA := ^2 p(r) 
reR A 

An SCFG is called proper if pa = 1 for all nonterminals A. It is however easy to see that 
requiring properness for SCFGs is without loss of generality, even when the grammar needs to be 
in a special normal form, such as CNF, because we can always make the stochastic grammar proper 
by adding an extra rule A 1 NN which carries the residual probability (1 — pa), where N is a 

new nonterminal and where there is a new rule N A NN. This yields a new proper SCFG which 
has exactly the same probability of generating any particular finite string of terminals as did the 
old SCFG, and has the same finite parse trees with the same probability. We can therefore assume, 
w.l.o.g., that all the input SCFGs we consider in this paper are proper. 

An SCFG is in Chomsky Normal Form (CNF) if it satisfies the following conditions: 

5 In some of our algorithms, while processing input SCFGs, they may become improper, in which case we can 
clearly then convert them back again to proper SCFGs, by the same method. It is worth noting that in some 
definitions of PCFGs, notably in [27], the authors even permit the sum of the probabilities of rules associated with 
a given nonterminal A to be p' > 1. Specifically, they define PCFGs to be any weighted context-free grammar where 
all rules have a weight p G [0, 1] , without the condition that the weights associated with a given nonterminal A must 
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• The grammar does not contain any e-rule except possibly for a rule S A e associated with 
the start nonterminal S; if it does contain such a rule then S does not appear on the right 
hand side of any rule in the grammar. 

• Every rule, other than S 4 e, is either of the form A BC, or of the form A — > a where A, 
B, and C are nonterminals in V and a G S is a terminal symbol. 

We define a finite parse tree, t, for a string w G X* in a SCFG, G, starting at (or rooted at) 
nonterminal A, to be a rooted, labeled, ordered, finite tree, such that all leaf nodes of t are labeled 
by a terminal symbol in S or by e, and such that all internal (i.e., non-leaf) nodes are labeled by 
a pair (B,r), where B € V is some nonterminal of the grammar, and r G .Re is some rule of the 
SCFG associated with B. For an internal node z that has label (B,r), we define Li(z) := B, and 
Li{z) := r to describe its two labels. 

If an internal node z has L^z) = r, and RHS(r) = 7, then node z must have exactly I7I 
children, unless 7 = 6. The children of z are then labeled, from left to right, by the sequence of 
symbols in 7. (If 7 = e, then the single child is a leaf labeled by e.). Finally, for t to be a finite 
parse tree for the string w G S*, it must be the case that the labels on the leaves of the tree, when 
concatenated together from left to right, form precisely the string w. Note that the empty string, 
e, is an identity element for the concatenation operator. 

We consider two parse trees to be identical if they are isomorphic as rooted labeled ordered 
trees, where the label of an internal node z includes both the associated nonterminal L\(z) and the 
associated rule L,2(z). We use Tq w to denote the set of distinct parse trees rooted at nonterminal 
A for the string w G £* in the SCFG G. 

We now describe a probabilistic derivation for a SCFG, G, starting at a nonterminal A, as a 
stochastic process that proceeds to generate a random derivation tree, which may either be infinite, 
or may terminate at a finite parse tree. The derivation starts as a tree To with a single root node 
root which the process will randomly "grow" into a tree as follows. The root is labeled by the 
start nonterminal A, so L\(root) := A. At each step of the derivation, for every current leaf node, 
z, of the current derivation tree, Tj, such that the leaf node z has L\(z) = A, we "expand" that 
nonterminal occurrence by randomly choosing a rule r G Ra, letting L^{z) := r, where the rule r is 
chosen independently at random for each such leaf node, according to the probabilities p(r) of the 
rules r G Ra- H We then use the chosen rule r to add \RHS(r)\ new children for that leaf node, 
where these children are labeled, from left to right, by the sequence of terminal and nonterminal 

sum to < 1, but with the added condition that the total weight of generating any finite string must be in [0, 1]. The 
"weight" of a given string generated by a weighted grammar is defined and computed analogously to the way we 
compute the total probability of a string being generated by an SCFG. We shall not elaborate on the definition here. 
As we shall discuss, this definition of PCFGs as a more general subclass of weighted context-free grammars is in fact 
too general in several important ways. In particular, we showed in [13] that such weighted grammars subsume the 
general RMC model, for which we proved in [13] that computing or even approximating termination probabilities to 
within any nontrivial approximation threshold is already at least as hard as some long standing open problems in 
numerical computation, namely SQRT-SUM and PosSLP, which are problems not even known to be in NP. Thus it is 
unlikely that one could devise a P-time algorithm for approximating the "termination probability" for the generalized 
definition of PCFGs based on weighted grammars that is given by [27]. However, the important point is that, as we 
show, we don't need to solve this more general problem in order to approximate pa,m for standard SCFGs. We restrict 
ourselves in this paper to the more standard definition of SCFGs (or PCFGs). Namely, we assume that probability 
of rules associated with each nonterminal must sum to < 1, and in fact w.l.o.g., that they sum to exactly 1. 

6 Note that if the SCFG is not proper, then as described this is not a well-defined stochastic process. We rectify 
this by simply asserting that if the SCFG is not proper, then with the residual probability (1 — pa) at every leaf 
labeled by A we generate two new children labeled by a new special nonterminal N which will generate an infinite 
tree with probability 1, via a rule TV —>■ NN. This corresponds to the way we converted any (CNF-form) SCFG into 
a proper (CNF-form) SCFG. 
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symbols in 7 = RHS(r). If 7 = e, then there is only one child added, labeled by e. We continue to 
repeat this "expansion" process until the derivation yields a finite parse tree having only terminal 
symbols (including possibly e) labeling all of its leaves, in which case the process stops. Otherwise, 
i.e., if the process never encounters a finite parse tree having only terminal symbols labeling the 
leaves, then the derivation never stops and goes on forever, generating an a infinite sequence of 
larger and larger derivation trees. If the derivation stops and generates a finite parse tree, t, then if 
the concatenation of the sequence of symbols on the leaves of that parse tree t is a string w G S*, we 
say that the derivation process on the SCFG G, starting at nonterminal A, has generated the string 
w. We use Pc,A(t) to denote the probability that the finite parse tree t rooted at A is generated 
by grammar G starting at nonterminal A. It is clear that Po,A(t) is the product over all internal 
nodes of t of the probability of the rule associated with that internal node. In other words: 

p G ,A(t)= n (13) 

{z I z is an internal node of t} 

We denote by Pq w the probability that starting at nonterminal A of the grammar G the 
derivation process generates the string w. Clearly we have: 

Pg,w= E P oAt) (14) 

We now extend the definition of "derivation" process, so that it can start not just at a nonter- 
minal, but at a string of terminals and nonterminals, as follows. 

For any string 7 £ (7U £)* of terminals and nonterminal of G, if 7 = e, the derivation process 
simply begins and ends with a tree consisting of one node labeled by e. Otherwise, if 7 = 71 . . . j m , 
where 7^ € (V U £) for i = 1, . . . , m, the derivation process consists of a sequence of derivation 
processes, starting at each symbol 7$, for i going from 1 to m. If 7$ is a nonterminal A, then 
the derivation process is the same as that starting at A. If 73 is a terminal symbol, then the 
termination process starting at 7$ simply begins and ends with a tree consisting of one node labeled 
by the terminal symbol 73. If the entire sequence of derivation processes terminate and generate 
finite parse trees, then if the sequential concatenation of the strings generated by each of this 
sequence of parse trees yields a string w £ S* we say that this derivation process starting at 7 
generated the string w. Let p G w denote the probability that derivation of the SCFG G, starting 
with the string 7, generates the terminal string w. 

The termination probability of an SCFG, G, starting at nonterminal A, denoted Oq, is the 
probability with which the derivation process starting at A eventually stops and generates a finite 
string, and a finite parse tree. It is clearly given by: 

Qg = Y, Pg, w 

An SCFG G is called consistent if q G = 1, where S is the start nonterminal of G. Note that even 
if the given SCFG G is proper (meaning the probabilities of rules associated with every nonterminal 

sum to 1), this does not necessarily imply that G is consistent. Indeed, we know that proper SCFGs 

2/3 1/3 

need not terminate with probability 1. For example, the SCFG given by S — > SS , S — > a, is 
proper but only terminates with probability 1/2. 

For the encoding of input SCFGs, for purposes of analyzing the complexity of algorithms, 
we assume that the probabilities associated with each rule of the input SCFG are rational values 
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encoded in the usual way, by giving their numerator and denominator in binary. We shall use |G| to 
denote the encoding size of an input SCFG G, i.e., the number of bits required to represent G with 
this binary encoding for the rational rule probabilities. In our formal analysis, when reasoning about 
our algorithms, we will in fact need to consider SCFGs whose rule probabilities can be irrational, 
but we shall not need to actually compute these probabilities exactly, only approximately. 

The general statement of the approximate total probability parsing problem is as follows. We 
are given as input: an SCFG, G, which w.l.o.g. we assume to be proper, we are also given a finite 
word w = w i . . . w n G S* over the terminal alphabet £ of the SCFG G. Finally, we are also given 
a rational error threshold 5 > 0. As output, we wish to approximate within error 5 the probability 
that a probabilistic derivation of G generates the string w, which we denote by pc,w '■= Pg w i where 
S is the start nonterminal of G. In other words, we want our algorithm to output a rational value 
v G [0, 1] such that \v — pc, w \ < $■ Importantly, we allow the grammar G to have e-rules of the form 
A A e. As we have discussed, allowing such rules makes this problem substantially more difficult. 

Our first main aim is to prove the following theorem: 

Theorem B.2. (Approximation of the total parse probability of a string on an SCFG) There is a 
polynomial-time algorithm for approximating the total probability that a given string w is generated 
by a given arbitrary SCFG, G, including an SCFG that contains arbitrary e-rules. 

More precisely, there is a polynomial-time algorithm that, given as input any (proper) SCFG, G, 
with rational rule probabilities and with a terminal alphabet T,, given any string w G £*, and given 
any rational value 5 > in standard binary representation, computes a rational value v G [0, 1] 
such that \v — pg,w\ < 5. 

Crucial for establishing these results is the following normalization theorem, which is of more 
general applicability. It says that any SCFG can be converted in P-time to a suitable "approximate" 
SCFG which is in CNF form. Let us give a precise definition of a notion of approximate SCFG. 

Definition B.3. For any SCFG G = (V, E,i2, S), and any 5 > 0, we define a set of SCFGs, 
denoted, B$(G), called the 5-ball around G, as following. Bs(G) consists of all SCFGs, G' = 
(V,£, R' , S), such that G' has exactly the same nonterminals V , terminal alphabet S, start nonter- 
minal S, as G, and furthermore such that the rules in R' of G' that have non-zero probability are 
exactly the same as the rules R of G that have non-zero probability, and furthermore for every rule 
r£RofG, the corresponding rule r' G R' of G' , we have \p{r) — p(r')\ < 5. 
For any G' G Bs(G), we say that G' 5 -approximates G. 

Theorem B.4. (Approximation of an SCFG by an SCFG in CNF form) There is a polynomial- 
time algorithm that, given as input any (proper) SCFG, G, with rational rule probabilities, given 
any natural number N represented in unary, and given any rational value 5 > in standard binary 
representation, computes a new SCFG, G' , such that G' is in Chomsky Normal Form, and has 
rational rule probabilities, and such that G' G Bs(G"), where G" is an SCFG in Chomsky Normal 
Form, which possibly has irrational rule probabilities, but such that for all string w G S* we have 
Pg,w = Pg",w Furthermore, the 6 -approximation G' of G" is such that for all strings w G £*, such 
that \w\ < N, we have: \pg,w — PG',w\ ^ <5- 

In other words, the P-time computed CNF form SCFG, G' , is a "good enough approximation" 
of G when it comes to the total parse probability of all strings up to the desired length, N, and G' 
also ^-approximates a CNF form SCFG, G", with irrational rule probabilities, such that G and G" 
generate exactly the same probability distribution on finite strings. We emphasize however that 
the length needs to be given in unary for this algorithm to run in P-time. 
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Let us note here again that Abney, McAllister, and Pereira [1] (Theorem 4), have established the 
existence of a SCFG in CNF form that has exactly the same probability of generating any nonempty 
string as the original SCFG. However, as they mention, their existence result is completely non- 
constructive, and yields no algorithm for computing or approximating such an SCFG. Of course, 
we note again that any such SCFG may require irrational rule probabilities. 

We shall thus show that the non-constructive result of PQ can be made entirely constructive, 
and that in fact an approximate version can be carried out in P-time. Specifically, we show that 
an SCFG, G, can be put through a sequence of "constructive transformations", some of which we 
don't actually compute explicitly, because they involve irrational rule probabilities, which ultimately 
leads, firstly, to an SCFG (with irrational rule probabilities) which is in CNF form, and which has 
exactly the same probability of generating any string, and secondly, thereafter to an "approximate 
SCFG" which has approximately the same probability of generating any string up to a desired 
length N, and which can be computed in P-time from the original SCFG. 

To begin our series of SCFG transformations, let us first observe that an obvious adaptation of 
the methods we used to prove Proposition 12.11 which showed that we can convert any MPS or PPS 
into one which is in simple normal form (SNF), can be used to also convert any SCFG G to one 
that is also in a simple normal form (SNF). By definition, a SCFG is SNF form if it only contains 
the following four kinds of rules: 

P 

1. A A BC, where B, C are nonterminals in V . 

P 

2. A — > B, where B is a nonterminal symbol. 

3. i4a, where where a G S is a terminal symbol. 

P 

4. A — >• e, e denotes the empty string. 

Lemma B.5. Any SCFG, G, with rational rule probabilities, can be converted in P-time to a 
SCFG, GW, in SNF form such that has the same terminal symbols S as G, such that G^ has 
rational rule probabilities, and such that G^ generates exactly the same probability distribution on 
finite strings in £*, i.e., such that p G m w = Pg,w for all strings w £ X*, and (thus) also G and 
have the same probability of termination. 
Furthermore, if the original grammar G had no e-rules then the new SNF grammar will 
also have no e-rules. 

Likewise, if the grammar G only has a single e-rule S — > e, where S is the start nonterminal, 
and where S doesn't appear on the RHS of any rule in G, then the same would hold for G^ . 

The proof is analogous to the proof of Proposition 12. 1L It involves adding new auxiliary non- 
terminals and new auxiliary rules, each having probability 1, in order to suitably "abbreviate" the 
sequences of symbols 7 on right hand side (RHS) of rules A A 7, whenever I7I > 3. We do this 
repeatedly until all such RHSs, 7, have I7I < 2. To obtain the normal form, we may then also need 
to introduce nonterminals that generate a single terminal symbol with probability 1. We leave the 
rest of the proof as an easy exercise for the reader. 

Clearly, every SCFG in Chomsky Normal Form form is in SNF form (but clearly not the other 
way around). We shall show that an SNF normal form SCFG can be "transformed" to CNF form, 
albeit to a CNF SCFG which may possibly have irrational rule probabilities, and which we will 
not actually compute. If however the SNF SCFG happens to contain no e rules, then we shall see 
that the resulting CNF SCFG has only rational rule probabilities, and can be computed exactly in 
P-time. 
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We will use Eq(A) := Pq to denote the probability that starting with the nonterminal A, the 
grammar G generates the empty string e, and we will use NEq(A) = 1 — Eq{A) to denote the 
probability that nonterminal A does not generate the empty string. When the grammar G itself is 
clear from the context, we will use E{A) to denote Eg(A) and NE(A) to denote NEq{A). 

Lemma B.6. 

1. There is a P-time algorithm that, given a SCFG, G, for each nonterminal A of G determines 
whether E(A) = I, i.e., NE{A) = 0, and likewise whether E{A) = 0, i.e., NE(A) = 1. 

2. There is a P-time algorithm that, given an SCFG G, any nonterminal A of G, and given 
any rational 5' > 0, computes a 5' -approximation of E(A) (and of NE(A)), i.e., computes a 
rational value v E £ [0,1] such that \E(A) — v E \ < 5'. And thus, letting v^ E := 1 — v E , we 
also compute v^ E such that \NE(A) — vj^ E \ < 5. 

Proof. Part (1.) of Lemma |B, 61 follows directly from the fact ([13]) that we can decide whether the 
termination probability of a SCFG is = 1, or is = 0, in P-time. 

Part (2.) of Lemma lB.61 follows directly from our main result, Corollary 14.51 which says that 
we can ^'-approximate the termination probability of a SCFG in P-time. 

To see why these hold, simply note that E(A) is precisely the termination probability, starting 
at nonterminal A, of a new SCFG obtained from G by removing all rules that have any terminal 
symbol a £ £ occurring on the RHS0 □ 

Consider a SCFG, G^\ in SNF form. As the next step of our transformation, we shall obtain 
a new SNF SCFG, G^ 2 \ where we remove all nonterminals A from the SCFG, such that 

E(A) = 1. We do so as follows: first, compute in P-time whether E(A) = 1 for every nonterminal 
A. If E{A) = 1, then remove all rules associated with A, i.e., all rules in Ra, and furthermore 
remove every occurrence of A from the RHS of any rule. In other words, if 7 is the right hand 
side of some rule and A occurs in 7, then remove those occurrences of A, and leave the remaining 
symbols in their original order. If this results in an empty RHS of a rule, then the RHS becomes 
the empty string e. 

In the special case where S is the start nonterminal of G^ and E(S) = 1, the SCFG G^ 
generates the empty string with probability 1, and in this case we make G^ the trivial SCFG 
consisting of only one rule: S — > e. 

Definition B.7. We call a SCFG, G, in SNF form cleaned if it contains no nonterminals A such 
that E(A) = 1, unless E(S) = 1 where S is the start nonterminal of G, in which case G is the 
trivial SCFG consisting of a single rule given by S -V e. 

The above discussion establishes the following Lemma. 

Lemma B.8. (Cleaned SCFG: removal of trivial nonterminals) Given an input SCFG, G^ in 
SNF form, we can compute in P-time a cleaned SCFG in SNF form, G^ 2 \ such that for all strings 
w G S* we have 

We are now ready for a critical step in our "transformation" which involves irrational probabil- 
ities. We will not actually compute this "transformation" exactly in our algorithms, but rather we 
will later do so "approximately" in an appropriate way. 

7 More precisely, we remove each such rule B A 7, and in order to still maintain a proper SCFG, we add a new 

V 1 

"dead" rule B — > NN, where N is a new "dead" nonterminal symbol that has associated with it the rule JV — > NN. 
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Lemma B.9. ( Conditioned SCFG: removal of epsilon rules) Given any cleaned non-trivial SCFG, 
G^ 2 \ in SNF form, there is an SCFG, G^ which has the same terminals and nonterminals as 
G^ and which is also in SNF form, but which does not contain any e rules, and such that for all 
non-empty strings w £ S + and all nonterminals A we have: 

Pgw, w =Pgw,v J * ne g(*>( a ) 

The SCFG G^ may contain rules with irrational probabilities, even if G^ does notE 

According to this Lemma, for any cleaned non- trivial SCFG, G^ 2 \ and any non-empty string 



w, 



In other words, the probability of generating the non-empty string w in G^ starting at nonterminal 
A, is precisely the conditional probability that G^ generates the string w starting at nonterminal 
A, conditioned on the event that G^ does not generate the empty string starting at A. This is 
why we call G^> the "conditioned SCFG" for G^E 

Proof. Given G^ = (V, S, R( 2 \S), we define the new SCFG, G^ = (V, S, R^ 3 \S), as follows. 
Below, whenever we refer to E(A) or NE(A) for any nonterminal A, these are with respect to the 
SCFG G( 2 \ i.e., E{A) := E aW (A) and NE(A) := NE G(2) (A): 

1. For each rule r of the form A A a in R&\ where a G £ is a single terminal symbol, we put 
into the following rule: 



r : A > a 

2. For each rule r of the form A A B in R^ 2 \ where B S V is a single nonterminal symbol, we 
put into R^ the following rule: 

p*NE(B) 
, . NE(A) „ 

r' : A B 

3. For each rule r of the form A A BC in R^ 2 \ where B,C £ V are nonterminals, we put all of 
the following three rules into R^: 

p*NE(B)*NE(C) 

r'(l) : A N -^-^ BC 



8 In fact, our proof establishes a more precise relationship between the parse trees of and G' 3 ' and their 
respective probabilities, but since we will not later use this stronger fact, we refrain from describing the precise 
relationship within the statement of the Lemma. 

9 Let us mention here that our proof of this Lemma is related in spirit to, but is quite different from, our proof in 
[15| of a conditioned summary chain construction for Recursive Markov Chains. 
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p*NE(B)*E(C) 

r'(2) : A NE(A) ) B 



p*E(B)*NE(C) 

r'(3) : A NE(A) > C 

We do not put any other rules into . This completes the definition of . 

Notice that it is possible that the rule probability for some of these rules will be 0, because 
E(B) and E{C) can be 0. In such a case, those rules have probability 0, meaning we can 
simply remove them from R^ 3 \ Notice also that the rule probabilities for rules in are all 
well-defined, because G^ is a cleaned SCFG, and thus NE(A) > for all nonterminals A. 

Claim B.10. If G^ is a proper SCFG, then so is G^ . 

Proof. To see why this claim holds, observe that for every nonterminal A, 

NE(A)= J2p(r)*(l- P ™^ r) ) 

r€R A 

where the sum is over all rules r associated with nonterminal A in G^ . In other words, the 
probability that A does not generate the empty string in G^ is equal to the weighted sum of the 
probabilities that the RHSs 7 of rules associated with A do not generate the empty string. But 
then note that: 

1. For a rule of the form A A a, the probability that the RHS a doesn't generate the empty 
string is 1. 

2. For a rule of the form A A B, the probability that the RHS B does not generate the empty 
string is NE(B). 

3. For a rule of the form A A BC, the probability that the RHS BC does not generate the 
empty string is: NE(B) * NE(C) + E(B) * NE{C) + NE(B) * E(C). This is because we 
need at least one of B or C to not generate the empty string, and whether each of them does 
so or not is an independent event. 

From this we see that if we sum the probabilities of the rules associated with A in , assuming 
that G^ is proper, the numerators of these sums will sum up to NE(A) and thus since all of them 
have denominator NE(A), the SCFG G^ is also proper. □ 

We next have the key claim: 

Claim B.ll. For any nonterminal A, and for all non-empty strings w G S + , we have: 

Proof. We shall prove this key claim as follows. For every non-empty string w G S + , and every 
nonterminal A we will define a mapping, g Wj A, from finite parse trees t G ^(2) w to finite parse 

trees g w ,A(t) G T^ 3) w . We shall establish that the mapping g W: A has the following properties: 
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(1.) The mapping g w> A is well-defined, meaning that if T^ {2) ^ 0, then for any parse tree 

f G T G(2),t«' we have 9w,A(t) G T g(V, w - 
(2.) The mapping g Wj A is onto, meaning if T^ 3) ^ 0, then for any tree t' G ^(3) w we have 

(3.) Finally, the following equality holds for all parse trees t' G 2^ (3) : 

P G{3) Jt')*NE G(2) (A)= P GwJt) (15) 

te9~A(*') 

In other words, the probability of parse tree t' of w rooted at A in G^ times the probability 
that the nonterminal A does not generate the empty string e in G^ 2 \ is the same as the sum 
of the probabilities of all parse trees t of w in G^ rooted at A that are mapped to t' by g Wt A- 

Once we establish the above three properties for the mapping g Wi A, which we shall define shortly, 
the Claim lB.111 follows basically immediately, because: 



Pg&,w = E P Gm,A^) ( b y equation jT 

teT* 



E E p o^,A(t) 

*' eT G(3),»^C') 

E P g(3),a(*') * NE G(a) (A) (by equation (USD) 



G( 3 ), 



( E ^(3)^(0) *^GW(^) 



t'&T A ,,,. 

= Pg(3),w * ^Gt 2 ) (^) ( b y equation (JT 
It now remains to define g Wt A, and then to establish properties (l.)-(3.). 

Given a parse tree t G T^ (2) , we define g w ,A(t) via a simple kind of "pruning" of t, as follows. 
Let us call a subtree t* of t a e-maximal subtree if, firstly, all leaves of t* are labeled by e, and 
secondly, either t* is t, or else it is not the case that all leaves of the subtree rooted at the immediate 
parent of the root of t*, within t, are also e. So, e-maximal subtrees are maximal sub-parse-trees 
of t that generate the empty string. 

We shall define g w ,A(t) to be the "pruning" of t obtained by removing all e-maximal subtrees 
of t. We do not replace the removed subtrees by anything. To be precise, when we remove one of 
the ordered children of an internal node of t, and the subtree rooted at that child, we retain the 
relative ordering of the other children with respect to each other. 

Firstly, note that g w ,A{t) will indeed retain the root node of t, labeled A. This is because t is a 
parse tree of the non-empty string w, and thus it can not be the case that all leaves of t are labeled 
by e. 

Our definition of g w ,A(t) is not yet complete. In more detail, we have to define all labels, 
including rule labels, of all nodes in g w A(t)- We do so as follows. 
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To do so, first note that for every node z of t that is retained in g Wt A(t), note that z is a leaf in 
g w A{t) if and only if it was already a leaf in t. 

For every leaf node z of g w ,A(t) we retain exactly the same label, L(z) = a G S, which was its 
label in i. 

For every internal node z of g w ,A(t), we retain exactly the same nonterminal label, = B, 

labeling the corresponding node z in t. Furthermore, if in t the node z was labeled by a rule 
r = £2(2), then we do as follows: 



L2 (z) be the corresponding rule r 1 given by B P ^ ^ ^ a. 



1. If the rule r is of the form B A- a, for some terminal symbol a G S, then in g w ,A(t) we let 

rule 

2. If the rule r is of the form A' A B 1 , for some nonterminal symbol B', then in <? Wi a(£) we let 
L2(z) be the corresponding rule r' specified by B P ^ C -H ^ ^ C. 

3. If the rule r is of the form B A CD for nonterminal symbols C and D, then in g w a{P) we 
shall assign £2(2) one °f the three corresponding rules r'(l), r'(2), or r'(3), based on the 
following: 

If in t neither child of z was an e-maximal subtree, then in g w< A(t) we let Li{z) := r'(l). If 
in t the right child of z was an e-maximal subtree, then in g Wt A{t) we let ^2 (z) '■= r'(2). If in 
t the left child of z was an e-maximal subtree, then in g w ,A{t) we let Li{z) := r'(3). 

The reader can easily confirm that these are the only possibilities, since otherwise the node z 
would have been "pruned out", by the definition of the tree defining g w a{P)- So this mapping of 
rules to nodes of g Wj A(t) is well-defined, i.e., g w ,A{t) £ T^ {3) Indeed, consider the parent node, z', 
of the root of an e-maximal subtree, t*, of t, and suppose that node z' is labeled by a nonterminal 
A'. Note that the rule r associated with the parent node, z', in t must be of the form A' A B'C', 
where B' and C are non-terminals, and one of them, say B' w.l.o.g., is the label of the root of the 
e-maximal subtree t*. This is because if the rule associated with z' was a linear rule of the form 
A' —> B', then t* would not be an e-maximal subtree in t. Thus if t* is rooted at the left child, 
labeled B', of node z' , by construction, the rules of will include the rule r'(3) given by: 

p*E(B')*NE(C') 

a' NE{A,) ) a 



and we have defined the parse tree g Wt A(t) so that it uses this rule of at the node z' . In other 
words, we have let L>2(z') = r'(3). Similarly, if t* is rooted at the right child of z' labeled by C, we 
have made g w ,A(t) use the rule 

p*NE(B')*E(C') 

A' NE(A>) ) B> 

which exists by definition of R( 3 \ Thus g Wj A{t), as defined, is a parse tree of G^ and is clearly a 
parse tree of the string w. We have thus established (1.), i.e., that indeed g W: A(t) € ^3(3) w - 

Next we establish (2.), that the mapping g Wj A is onto. Suppose T^ {3) ^ 0, and suppose that 
t' G T^(3) ■ If f does not contain any internal node labeled with a rule of the types r'(2) or r'(3), 
then it is easy to see that exactly the same parse tree, where we replace every rule label r' or r'(l) 
by their corresponding version r in G^ 2 \ is indeed a parse tree t G T^ {2) such that g Wt A{t) = f. 
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If on the other hand t' does have some internal node labeled with a rule of the type r'(2) or 
r'(3), then without loss of generality (by symmetric arguments) suppose it is a rule of the form 

r'(2) given by A' P ^ ^ -i^* ^ ^ ^ B'. Note that it must be the case that E{C') > (otherwise 
t! has probability and thus is not a parse tree in G^), and thus there is some parse tree rooted 
at C which generates the empty string e. 

Thus, for all such rules of the form r'(2) labeling a node z of t' , we will be able to convert 
the rule at the corresponding node z of a parse tree t of G^ to the original rule r of the form 
A' 4 B'C from which r'(2) was generated, and then we can add any sub-parse tree for the empty 
string e, rooted at the child of z in t labeled by nonterminal C. We can also obviously do the 
symmetric thing for nodes labeled by rules r'(3) in t' . In this way, we will have constructed a tree 
t G T^ 2) such that g w ,A(t) = t' This establishes property (2.), namely that g w ,A is onto. 

Finally, we have to establish the key property (3.), namely that for every parse tree t' G T A (3) , 
we have: 

(t')*NE G(2) (A)= £ P 9 W,A(t) 

The key to establishing this equality is the following inductive claim. Let us define a mapping 
h from rules of G^ back to their "corresponding" rule in G^ 2 \ Specifically, for every rule r' of G^ 
we see easily that by our definition of this rule was generated directly from a "corresponding" 
rule r in R^ 2 \ We simply define h(r') := r. 

We extend this mapping h to a tree t' G T A (3) w , by defining h(t') to be the multi-set of rules in 

that arise by mapping back the rule label L^iz) of every internal node in t' to its corresponding 
rule h(L2(z)) in R^ 2 \ It is important that h(t') is a multi-set, i.e., that it retains k copies of the 
same rule r if there are k nodes z of t' for which h(L2(z)) = r. 

We need some more definitions. For a tree t' 6 T A ,^ , let us define two other multi-sets of 

rules in G^ 2 \ namely, Z t i ^ and Z t > 3 , where Z t ^ 2 is a multi-set of rules in R^ containing one copy 
of a rule r G R^ for every instance of the corresponding rule r'(2) that labels some node z of 
t! . Similarly, Z^ 3 is a multi-set containing one copy of a rule r G R^ for every instance of the 
corresponding rule r'(3) that labels some node z of t'. Notice that all rules in the multi-sets Z# 2 
and Zf : 3 are of the form A' 4 B'C'. 

Let us define the following multi-sets corresponding to Z t >^ and Z t > %. Namely, let K t > t 2 be the 
multi-set of nonterminals in G^ defined by taking every rule instance r G Z t \2 and if r has the form 
A' 4 B'C, then adding a copy of C to K# 2- Likewise let Kf^ be the multi-set of nonterminals in 
G^ defined by taking every rule instance r G Z t >^ and if r has the form A' 4 B'C, then adding 
a copy of B' to K t i$. 

We are now ready to state and prove a key claim. 

Claim B.12. For every parse tree t' G T A (3) , we have 

*W* ) = (16) 

Note that the products are indexed over multi-sets, not sets. 
Proof. We prove this claim by induction on the depth of the parse tree t'. 
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For the base case, if the parse tree t' has depth 1, then it has only one internal node which is 
the root, and that root is labeled by a rule r' of the form: 

. NE(A) 

A > a 

Thus t' is a parse tree of the string w = a, and P G (s) w (t') = p/NE(A). But since K t ^2 = -Kt',3 = 0, 
we see that the right hand side of equation (Tl6j) is also equal to p(r)/NE(A). 

Inductively, suppose that t' has depth > 2. There are different cases to consider, based on the 
rule labeling the root of t'. 

1. Suppose that the root z of t' is labeled by a rule L2{z) = r' which has the form: 

p{r)*NE(B) 

A ^ q ~> B 
and that h(r') = r G R^ 2 \ where rule r has the form A —\ B. 

Thus the root z of if has only one child node in t', call it z*. Let t* G ^5(3) w denote the parse 
subtree of tf rooted at z* . We know that L\(z') = B, and by inductive assumption we know 
that 

(IW)P(r)) * (UceK t , >2 E{C')) * 0W t *, 3 W)) 



^G(3),to(**) 



NE(B) 



But note that P G (3) = P G (3) «;(**) * P( r ')> an d by multiplying and canceling, we get 

PamA*) = P G(V, w (t*)*(p(r)*NE(B)/NE(A)) = <-> ^ ^ . 

That completes the induction in this case. 

2. Suppose that the root z of t' is labeled by a rule £2(2) = which has the form: 

P (r)*iViS(B 1 )«iVE(a 2 ) 
A ^ „ SijBa 

and that h(r'(l)) = r G i?( 2 ), such that rule r has the form ^4 ^->- B\B 2 . 

In this case, the root z of t' has two children, a left child z\ and a right child z 2 . Let 

*i G 

^G(3) w . an ^ *2 G T^ 3) be the two parse trees rooted at z\ and z 2 respectively. Clearly 
we must have w = W\W 2 . 

We know that Li(.Zi) = -Bi and Li(z 2 ) = B 2 . Moreover, by inductive assumption, we know 
that for i = 1,2, we have 

p u , _ (n rehfa )P(0) * (jw ggo) * aw,, ggo) 

Note again that P G ( S ) >w (t') = P&s) >v)1 (ti) * P G(3 ) itUa (*2) * P{r') ■ 

Again, by multiplying and canceling, we get P G(3 ) w (t') = P G(3 ) Wl (ti) * P G ( 3 ) m (t 2 ) * (p(r) * 

TVP^x) * NE(B 2 )/NE(A)) = ^ ^ ^ . 

This establishes the inductive claim in this case. 
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3. Suppose that the root z of t' is labeled by a rule £2(2) = r'(2) which has the form: 



P (r)*NE{B 1 )*E{B 2 ) 

A ^ > B 1 



and that h(r'(2)) = r E i?( 2 ), such that rule r has the form A B\B2- 

In this case, the root z of tf has one child, z\. Let ii G be the parse tree rooted at zi. 

We know that Li(zi) = £?i. Moreover, by inductive assumption, we know that 

D „ , _ OW) PiT)) * OW^ E{C')) * (IW tli3 E{B')) 

P G^,M - 

Note again that P G (3) >w (t') = P G (s) tWl (*i) *p(r'). 

Observe that the multiset K t i 2 consists of Kf ly 2 U {-B 2 } where the union here denotes a 
multi-set union, so it contains an added copy of B2. Thus, by multiplying and canceling, we 
get 

P G C3>,J0 = P G i 3) , ull (h)*(p(r)*NE(B 1 )*E(B 2 )/NE(A)) 

NE(A) 

This establishes the inductive claim in this case. 
4. Suppose that the root z of t' is labeled by a rule L2{z) = r'(3) which has the form: 

p(r)*E(B 1 )*NE(B 2 ) 

A ™ ► S 2 

and that h(r'(3)) = r E i?( 2 ), such that rule r has the form ^4 -61-82- 

This case is entirely analogous (and symmetric) to the previous one, and thus an identical 
argument shows that the inductive claim holds also in this case. 

This completes the inductive proof of the claim, since we have considered all possible rule types 
that can label the root node of t'. 

□ 



We now use Claim IB. 121 to show that property (3.) holds for the mapping g w a_. 
Consider a parse tree t' E 7^ (3) . Claim IE.12I tells us that 

(iW) pW) * aw„ , e{C)) * (n B , eKf , , e(b')) 



Pc(3) '- (t } " JVS(I) 

Note that for any nonterminal B, E(B) is the sum of the probabilities of all distinct parse trees 
rooted at B which generate the empty string e. Let ET{B) be the set of all these parse trees that 
generate e from B. 

Let us now consider the probability of parse trees in g~\{t') C T^ (2) w . Note that each such 
parse tree t E g~\(t') can be specified by specifying how t has "expanded" every nonterminal in the 
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multisets K t /^, and Ky^ into parse trees of the string e. Specifically, t is determined by specifying 
for each occurrence of each nonterminal B in both K t i 2 and K t i^, which parse tree of ET{B) is 
used to expand B into a parse tree for e. Then the probability of t is given by the product of 
(IlreMt') P( r )) multiplied by the product of all these chosen parse trees of e chosen to expand every 
nonterminal occurrence in Kf 2 an d in K t /^. 

But then since for every nonterminal B, E(B) is the sum of the probabilities of all distinct parse 
trees rooted at B which generate e, we can see that, by summing over all parse trees t € S^^CO; 
and then collecting like terms, we get 

e p<m,»® = ( n p * ( n * ( n e ^ 

teg-\{t') r<=h(t>) C'eK t , t2 B'6K t , j3 

But then by Claim IB~l~2l the identity (fT5|) follows, and thus we have established property (3.) 
of the mapping g w ^ which is the last thing we needed to establish to complete the proof of Claim 
IBUl □ 

This completes the proof of Lemma IB, 9\ and establishes the correctness of the quantitative 
properties it asserts for the conditioned SCFG, G^ 3 \ which is in SNF form, and which furthermore 
contains no e-rules. □ 

Next we show how to "transform" to get rid of the "linear" rules of the form A A B, and 
thus obtain a CNF form SCFG. 

Lemma B.13. Given any SCFG, G®, which is in SNF form~^\. and which contains no e-rules, 

there is an SCFG G^ in Chomsky-Normal-Form (CNF), such that G^ has the same terminals 

and nonterminals as G^ and such that for all nonterminals A and all strings w £ £*, we have 
A A 

Furthermore, if G^ has only rational rule probabilities, then the transformation from G^ to 
G^ is effective and efficient, in the sense that G^ also has only rational rule probabilities, and 
can be computed in P-time from G^ . When G^ has irrational rule probabilities, then G^ 
still exists but may require irrational rule probabilities. 

Proof. The only thing we need to do in order to obtain G^ from G^ is to eliminate "linear" rules 
of the form A — > B, where A and B are nonterminals. 

This is easy to do by solving a suitable system of linear equations whose coefficients are taken 
from rule probabilities in the grammar 

Specifically, consider the following finite-state Markov chain, M, whose states consist of the set 
of nonterminals of G^ as well as the set of all distinct "right-hand sides", 7, that appear in any 
rule A A 7 of G^ , and where 7 is not a single nonterminal. 

The probabilistic transitions of M consist of the rules of G^ . In other words, if there is a 
grammar rule 4 4 7, then there is a probabilistic transition (A,p, 7) in M. Note that we can 
assume G^ 

is a proper SCFG, so this defines all the transitions out of nonterminal states of M. 
Finally, for every state 7 that is not a single nonterminal, we add an absorbing self-loop transition 
(7,1,7) to M. 

Consider any RHS, 7, which is not a single nonterminal. Let q* A ^ denote the probability that, 
in the finite-state Markov chain, M, starting at state A, we eventually hit the state 7. 



We assume, as always, that G ' is proper, and it is easy to check that all our transformations maintain the 
properness of the SCFG 
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Using these hitting probabilities we can easily eliminate the linear rules of . We "construct" 
G^ as follows. In G^ we remove all linear rules from G^ , and for every nonterminal A and RHS 

7, where 7 is not a single nonterminal, if q* A > 0, then add the rule A 7 to G*- 4 ). 

To maintain the properness of the grammar, if the sum of the probabilities of the rules of a 
nonterminal A is less than 1 , we add as usual a rule A — > NN with the remaining probability where 
N is a dead nonterminal. 

It is easy to see that the total probability Pq^ w of generating any particular string w in 

G (4) 

starting at nonterminal A remains the same as the total probability of generating w starting 

at A in G®. 

It only remains to show that if the rule probabilities of G^ are rational, then we can compute the 
hitting probabilities q* A in polynomial time. But it is a well known fact that hitting probabilities 
can be obtained by solving a corresponding system of linear equations. 

Specifically, consider RHS, 7, which is not itself a single nonterminal. We can easily determine 
the set of nonterminals A for which the hitting probability q* A > is positive. This is the case if 
and only if in the underlying graph of the Markov chain M there is a path from the state A to the 
state 7. 

Suppose there are n distinct nonterminals A in G^ such that q* A > 0. Let us index these n 
nonterminals as: A±, . . . , A n . Let P denote the n x n substochastic matrix whose (i, j)'th entry 
Pij is the one-step transition probability from state A4 to state Aj in the Markov chain M. Let 
the column n-vector 6 7 be defined as follows: bj is the one-step transition probability from state 
Ai to state 7 in M. Then if we let the column n-vector x of variables, X{, represent the unknown 
hitting probabilities, q* A _ , we have the following linear system of equations: 

x = Px + 6 7 

which is equivalent to the linear system of equations 

(I - P)x = 6 7 (17) 

Clearly, letting X{ = q A ^ is one solution to this equation. Moreover, since P represents the 
transition submatrix of all the transient states within a finite-state Markov chain, it follows from 
standard facts (see, e.g., [5], Lemma 8.3.20) that p(P) < 1, where p(P) denotes the spectral radius 
of the substochastic matrix P. It thus follows from Lemma IA.1I that the matrix (/ — P) is non- 
singular, that (J — P)^ 1 = T^k = QP k - Therefore there is a unique solution vector x* = (J — P) _1 6 7 
for the system of linear equations in (|17h , where x* = q* A . are precisely the hitting probabilities 
for every i. 

Thus, if G^ has only rational rule probabilities, then we can compute G^ in P-time by solving 
one such a system of linear equations for each RHS, 7, of a rule in G^ 3 \ which is not itself a single 
nonterminal. 

In fact, even when G^ contains irrational rule probabilities, we will later use the linear system 
of equations (|17p in important ways in our approximability analysis. □ 

Lemma EH allows us to finally "obtain" a SCFG G^ which is in CNF form, starting from our 
original SCFG, G, via a sequence of transformations. Unfortunately, in the process of obtaining 
some of our transformations required possibly introducing grammar rules with irrational rule 
probabilities. We now show that we can nevertheless efficiently compute a suitable approximation 
to G^ . The first step toward this is to establish the following Lemma. 
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For any SCFG, G, let G^, i = 1, ... ,4, denote the SCFG obtained from G via the sequence 
of transformations described in Lemmas IB.5I to IB. 131 In general, G^ may have irrational rule 
probabilities, even when G does not. Nevertheless, we shall show that, given G and 5 > 0, we can 
compute in P-time a SCFG Gf G B s (Gk)), for i = 1, . . . , 4. First: 

Lemma B.14. There is a polynomial-time algorithm that, given any proper SCFG G with rational 
rule probabilities, and given any rational value 5 > in standard binary representation, computes a 
proper SCFG, g! G B$(G^), with rational rule probabilities. In other words, given G and 5 > 0, 
we can compute in P-time a 5- approximation of G^ 3 ' . 

Proof. To prove this theorem, we will show that every step of our transformations beginning with G 
and resulting in the CNF SCFG can be carried out either exactly or approximately in P-time. 

G ~* GW ~» G( 2 ): It was already established in Lemma IB . 5 1 and Lemma lB.81 that we can carry out 
these first two steps of the transformation exactly in P-time. Specifically, given any SCFG, G, with 
rational rule probabilities, we can construct in P-time a cleaned SCFG, G^ 2 \ in SNF form with 
rational rule probabilities such that, in particular, for all strings to £ E*, we have pg,w = Pg( 2 ) w 

Furthermore, we can assume that G^ is nontrivial, meaning that the start nonterminal does 
not generate the empty string with probability 1, because if this was the case, then we know the 
transformation would have computed as G^ the trivial CNF SCFG consisting of only the single 
rule S -V e. In that case we would be done, so we assume w.l.o.g. that the result was not this 
trivial SCFG. 

q(2) ^ q(V . ri eca n that the key transformation G (2) G (3) may introduce irrational rule prob- 
abilities into the "conditioned SCFG", G^. Given G {2 \ and given 5 > 0, we now show how to 
compute in P-time a proper SCFG Gf G B S (G®). 

To do this, we make crucial use of our P-time approximation algorithm for termination prob- 
abilities of an SCFG, and in particular its corollary, Lemma IB.61 which tells that that given an 
SCFG G^ 2 \ for each nonterminal A, we can determine in P-time whether E{A) = or NE(A) = 0, 
or whether E(A) = 1 or NE(A) = 1, and given any 6' > 0, we can in P-time approximate E(A) 
and NE(A) within distance 6', i.e., we can compute rational values G [0,1] and v^ E G [0,1] 
such that \E(A) — < 5' , and \NE(A) — vne\ < S'. We will see how to choose 5' shortly. 

Note that the probability p(r') > of a rule r' in G^ 3 ** can only have one of several possible 
forms. In each case, p{r ! ) is given by an expression whose denominator is NE{A) = NE G (2) (A) for 
some nonterminal A of G^ . 

Note that E{A) is precisely the termination probability, starting at nonterminal A, of a new 
SCFG obtained from G^ by removing all rules that have any terminal symbol a G £ occurring 
on the RHS. It thus follows that NE(A) is the non-termination probability starting at A for that 
SCFG. By Lemma IB.61 we can determine in P-time whether NE(A) = 1, and by construction 
of G^ we know NE(A) / 0. Since termination probabilities of an SCFG are the LFP, q*, of 
a corresponding PPS, x = P(x), which has the same encoding size as G, we can conclude from 
Theorem EH] that for all nonterminals A, where NE(A) / 1, NE(A) > 1/2 4 I G<2) L Let us define 
C = 1/2 4 I G(2) L 

For a fixed nonterminal A, since every rule r' of G^ associated with A has an expression whose 
denominator is the same, namely NE(A), since we have the lower bound NE(A) > £, and since 
G® is proper, then in order to make sure that our resulting SCFG Gf ] is also proper, it suffices 
to only approximate "sufficiently well" the numerators of each rule probability p(r') for rules r' 
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associated with A, and then normalize all these values by their sum in order to get proper rule 
probabilities. 

For each rule r' of G^ with positive probability p(r') > 0, we wish to compute a rational value 
v r i e (0, 1] such that \p(r') — v r / \ < 5. 

Note that m := 3\G^\ is an easy upper bound on the number of distinct rules in G^K How 
well do we have to approximate the numerators of probabilities, p(r'), for rules r' associated with 
a nonterminal A in G^ , in order to be sure that after normalizing by their sum, these numerator 
values yield probabilities v r / such that the inequality \p(r') — v r >\ < 5 holds for each one? The 
following claim addresses this: 

Claim B.15. Suppose ai,...,a r 6 (0, 1], where r < m, suppose 6 6 ((, 1], < ( < 1, and suppose 
that YH=i a i = b- For any 5 > 0, such that 5 < l/(4m) ; let 5' := 5((/2) 2 /4m. Suppose we find 
a'i, . . . ,a' m G (0, 1] such that \ai — < 5' for all i = 1, . . . ,r. Let b' = Yll=i a 'i- Then for all 
i = l,...,r: 



Proof. First note that \b — b'\ = \ YH=i( a i 
1 + <5(C/2) 2 /4 < 2. Note also that b' > b 
and mb' < C,/2. Now we have: 

1 b V 



□ 

It follows from Claim IE. 151 that in order to approximate every rule probability within distance 
5 > 0, where we assume 5 < l/4m where m = SlG*- 2 -* | , it suffices to approximate the numerators of 
the probabilities p{r') for every rule r' associated with A in G^ within distance 5' = 5{C,/2) 2 /4m, 
where Q = 1/2 4 I g(2) I is the lower bound we know for NE(A), and then to normalize these approxi- 
mated numerators by their sum in order to obtain the respective probabilities. 

To complete the proof, we now consider separately all the possible forms the rule probability 
p{r') could take, and show how to approximate each of their numerators within 5' . 

1. Suppose p(r') = p(r)/NE(A) > 0, where p{r) is the given rational rule probability of the 
corresponding rule r of G^ . 

In this case, we already have the exact rational numerator probability, p(r). 

2. Suppose p(r') = p(r) * NE(B)/NE(A) > 0, where p{r) is the given rational rule probability 
of the corresponding rule r of G^ . 

Since p{r) is a probability, to approximate p{r) *NE(B) within additive error 5' , it suffices to 
approximate NE{B) to within additive error 5'. We already know how to do this in P-time, 
because NE(B) is the non-termination probability of a SCFG that we can derive in P-time 
from . 



1 b V 1 - 

■a'i)\ <r*5'< m*5'. Then note that < b' < b + m5' < 
m5' > C/2, the last inequality following because b > £, 



, b'aj — ba' ; , 



bb' 1 
2777,(5' max(b, b') 
W 

^(C/2) 2 ■ 
bb' 1 
< 5 
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3. Suppose p(r') = p(r) * E{B X ) * NE{B 2 )/NE(A) > 0. 

To compute a 5'- approximation a r > of the numerator, such that \a r i — p{r)*E{B\)*N E(B 2 )\ < 
5', we let 5" = S'/A, and we compute approximations v^, 1 and v^ 2 E , of E{B{) and NE(B 2 ), 
respectively, such that \E(Bi) — v I E L \ < 5" and \NE(B 2 ) — v^ 2 E \ < 5", and we let a r > := 
p(r) * w^ 1 * v^ 2 E . Then: 

< 25" * max(?;f 1 , v% ,E(B 1 ),NE{B 2 )) 

< 5' 

4. The only remaining case is when p(r') = p(r) * NE(B\) * N E{B 2 ) / N E{A) > 0. Its proof 
argument is identical to the previous case. We can just replace E(B{) *NE(B 2 ) by NE(B\)* 
NE(B 2 ) and v^ 1 * v^ 2 E by v^ E * v^ 2 E in that argument. 

We have thus established that there is a polynomial time algorithm that, given and 5 > 0, 
computes an SCFG Gf ] G B 5 (G^). □ 

Our next goal is to prove the following Theorem: 

Theorem B.16. There is a polynomial-time algorithm that, given any proper SCFG G with rational 
rule probabilities, and given any rational value 5' > in standard binary representation, computes 
a proper SCFG, G^ G Bs'(G^), with rational rule probabilities. 

Proof. Recall that to obtain G^ from G^ we have to eliminate from G^ the "linear" rules of the 
form A A B, where A and B are nonterminals. Lemma IB. 131 showed how this can be done. The 
proof of Lemma |B. 131 considered the finite-state Markov chain, M, whose states consist of the set 
of nonterminals of G^ as well as the set of all distinct "right-hand sides", 7, that appear in any 
rule A A 7 of G^ , and where 7 is not a single nonterminal, and where the probabilistic transitions 
of M basically correspond to the rules of G^ 3 \ plus extra absorbing self- loop transitions (7,1,7), 
for every state 7 that in not a single nonterminal. 

Note that M may have irrational probabilities. The proof of IB. 131 showed that the probabilities 
q^, of eventually reaching the state 7 from the state Ai in M can be used as the probabilities of 

new rules Ai — ^? 7, after eliminating all linear rules, to obtain the SCFG G*- 4 -*, such that for all 
strings w, and all nonterminals A, we will have w = Pq{^\ w - 

Regardless whether M has transitions with irrational probabilities or not, the proof of Lemma 
IB. 131 showed that the probabilities q* A which we need can be obtained as follows: we can first 
identify those probabilities q* A that are greater than by using only the underlying "graph" 
structure of the grammar rules with positive probability, without the need to access their actual 
probability. Note that we determine which rules of have positive probability by simply looking 
at which rules of G$ G Bg(G^) have positive probability, for whatever 5 > we have chosen, 
because the definition of the approximate set Bg(G^) requires that positive probability rules retain 
positive probability in the approximate SCFGs. 

Once we have computed those cases where q* A = 0, in what remains the probabilities q* A > 
can be obtained as the unique solution x* = (I — P) _1 6 7 of a corresponding linear system of 
equations given in (|17p . which has the form: 

(/ _ p) x = ftT 
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where P is a substochastic n x n matrix. Note that by basic facts about transient states in Markov 
chains and substochastic matrices P. we have that (J - P)- 1 = J2T=o pk ^ °- 

Note that P and 6 7 may have irrational entries, because they have been derived from rule 
probabilities in G^ 3 \ We may hope that by approximating sufficiently well the entries of P and fe 7 , 
which are all rule probabilities of G^ 3 \ we can then use the resulting approximated linear system 
of equations, which will hopefully still have a unique solution which is close to the unique solution 
of the original linear equation system. 

Unfortunately, we can not do this in a very naive way, because some of the rule probabilities of 
(and thus transition probabilities of M) have in their numerator expressions containing E(B) = 
E G (2){B) for some nonterminal B of G^ 2 \ Since E(B) amounts to the termination probability of 
a SCFG (with rational rule probabilities) whose encoding size is 0(|G^ 2 ^|), it can unfortunately 
be the case that some positive probabilities in entries of the matrix P are extremely small (double 

exponentially small, and possibly irrational) values, namely as small as 1/2 2 , for a fixed 
constant c > 0. 

These very small entries mean that we can not immediately rule out that the system of equations 
(I — P)x = 6 7 is potentially very ill-conditioned. 

To overcome this, we observe crucially that the Markov chain M has a very special structure 
which allows us to transform it into a different Markov chain M', by basically removing some small 
but positive probability transitions, yielding a new chain M' with no transition probabilities that 
are "very small", and yet such that each state of M' has a probability of reaching state 7 that is 
very close to that of reaching 7 in the original Markov chain M. 

The special structure of M arises for the following reason: the only kinds of rules r' out of 
a nonterminal A of G^ that have a probability p(r') which contains a probability E(B) in its 
expression are rules of the form r'(2) or r'(3). But then there must also exist a rule of the form 
r'(l) associated with A. Since the expression p(r'(l)) does not contain a probability E(B) (only 
probabilities of the form NE(B)), we can lower bound p(r'(l)) sufficiently far away from zero. 
Specifically, since 

r}(r , m) - P(r)*NE(B)*NE(C) 
P{ { >' NE(A) 

where p(r) is a rational rule probability in G^ (which only has rational rule probabilities), and 
NE(A), NE(B), and NE(C) are all probabilities of not generating e starting at different nonter- 
minals in G^ 2 \ and since we know that these probabilities can be rephrased as non-termination 
probabilities in a SCFG with at most the same size as |G( 2 )|, it follows from Theorem [3J2] that 

p(r'(l)) > - ^ (2) . Moreover, by definition the rule r'(l) has the form A P — \^ 7, where 7 is 
not a single nonterminal, and thus the corresponding transition in M must be a transition to an 
absorbing state 7 of the Markov chain. 

We claim that this then means that whenever p(r'(2)) or p{r'(3)) are sufficiently small probabili- 
ties, relative to p(r'(l)), then we can simply remove their corresponding transitions in M, yielding a 
new Markov chain M' , and this will not substantially change, starting at any state, the probability 
of eventually reaching any particular absorbing state 7'. 

More precisely, let £ = q ,|q(2) | ■ Let 8' > be some desired error threshold. Consider any state 

A of M such that there are rules of the form r'(2) and r'(3) associated with A in G^ , and thus 
corresponding transitions with probability p{r'{2)) and p(r'(3)) in M. Consider the absorbing state 
7 of M, to which there is a transition from the state A with probability p(r'(l)) > £. Consider 
any states £1, £2 of M (which may be absorbing or not). Suppose that p(r'(2)) < ( * 6'/2 and that 
p(r'(3)) < C * S'/2. Let us define the Markov chain M 1 by removing all transitions, and let us ask 
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how much the probability of eventually reaching £2 starting from £1 in M can change if we simply 
remove both of these transitions r'(2) and r'(3) out of state A from M. 

Since 7 is an absorbing state, we have that, even if we assume that there is a transition in M 
from A right back to itself with all of the residual probability (1 — p(r'(l)) — p(r'(2)) — p(r'{3))), then 
the total probability that, starting from A, we will ever use either of the transitions r'(2) or r'(3) in 
M is at most (p(r'(2)) + p(r' (3))) /p(r' (1)) < C*$' /C = Note that the case where all the residual 
probability feeds back to A yields the highest possible probability of ever using either transition 
r'(2) or r'(3). Thus, by removing transitions r'(2) and r'(3), for any state £21 we would have at most 
changed the probability of eventually reaching £2 starting at A by at most 8'. Likewise, starting at 
any state £1, the probability of eventually reaching £2 starting in £1 in M is at most changed by 
8 1 by removing transitions r'(2) and r'(3) out of A, because any path using these transitions must 
first go through state A, and the probability that it will eventually go through r'(2) or r'(3) is at 
most 5' . 

For any desired 8' > 0, let us compute in P-time an approximate Gg G B$ (G^), where 
8 = C * 5' /A. We can then detect the positive "low probability" transitions of the form r'(2) and 
r'(3), whose probability is < £ * S'/2, and we can remove them, yielding a new Markov chain M' , 
without changing the resulting probability of reaching any absorbing state 7' by more than 8' > 0. 

Let q' A denote the probability of reaching absorbing state 7 starting at state Ai in the Markov 
chain M' . We know that \q' Al — q\^\ < 8'. Our aim is thus to approximate the probabilities q' Al 
of eventually reaching the absorbing state 7 starting at any nonterminal state A in M', to within 
a desired error 5" > 0. 

Let A\, . . . ,A n r denote the nonterminal states of M' such that q' A . > 0. (Note that we can 

(3) 

detect such states in P-time by computing G s , because these are determined by the underlying 
graph based on rules with probability > 2 * 5 in , and these rules can be determined by 
computing Gg 3 \) 

Now the substochastic matrix P 1 associated with M 1 and 7 is defined to be an n' x n' matrix, 
where P-j is the one-step transition probability from state Ai to state Aj in the Markov chain M'. 

This yields for us, a new system (J — P')x = 6' 7 . Note that as defined P' may still have 
irrational entries, because we have not approximated the other positive rule probabilities which 
were not removed. Note that the states Ai, . . . , A n t of M' are transient, and thus again by standard 
facts (e.g., [5], Lemma 8.3.20) we have that the equation (/ — P')x = b' 1 has a unique solution 
x* = (I - P')-^. 

Furthermore, we have just argued that \x* — x*\ < 5'. It also always holds that the entries of 
6' 7 are either zero or > 1/2 9 *I G ^I, and that there is at least one non-zero entry in 6' 7 . 

Note that (I — P')^ 1 = ^2'kLi(P') k ■ We will need an upper bound on the row sums of (I — P')^ 1 . 
Note that P/*j- is the probability of being in state Aj in k steps after starting in state A^. 

Claim B.17. Let c > denote the smallest positive entry in P' , and let p > denote the smallest 
positive entry of b"> . Then for all i G {1, . . . , n'}, < EjliU ~ p %] < ^7 

Proof. Every state among A%, . . . , A n / has, by definition, a positive probability of reaching 7. Thus, 
since there are n' states in total, and each positive probability transition has at least probability 
c > 0, then the probability that, starting at any of these states Aj we reach 7 within n' steps is at 
least pc n . Thus the probability of not reaching 7 within n' steps is (1 —pc n ). But since this is the 
case for any such state Aj, the probability of not reaching 7 within dn' steps starting at any state 
Aj is at most (1 — pc n ) d . 
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Now note that (-P')fj ^ s the probability of being in state Aj after dn' steps. But by what we 
have just argued, we know that (-*")&' < (l-pc n ') d . Thus, for all i, Ejli( p ')f < 

E~ =0 (i-^') d = ^. 

Similarly, for any r £ {1, . . . ,n' — 1}, the probability of not reaching 7 within (in' + r steps is 
starting at any state Aj is also at most (1 -pc n ') d . Thus £"li( p ')S' +r < (1 -pc n ') d . Thus for all 
• € {1, . . . , n'}, and all r G {0, . . . , n' - 1}, we have ET=o E]U(P')ij +r < TZ^-P^'Y = 
But note that (/ - P')~j = (£~o ?'h 3 = Eto E7=o E]'=i(P')ij +r <n'*^ = ^r. □ 

We now show that approximating the entries of P' and b' 1 to within a sufficiently small desired 
accuracy 5" > yields a new approximate linear system of equations whose unique solution x* is 
within a desired distance of x*, and thus within a desired distance of x*. For this, we use a standard 
condition number bound for errors in the solution of linear systems of equations: 

Theorem B.18. (see, e.g., 121], Chap 2.1.2, Thm 3^) Consider a system of linear equations, 
Bx = b, where B G M nxn and b € M ra . Suppose B is non-singular, and b 7^ 0. Let x* = B~ l b be 
the unique solution to this linear system, and suppose x* 7^ 0. Let \\ ■ \\ denote any vector norm 
and associated matrix norm (when applied to vectors and matrices, respectively). Let cond(l?) = 
||.B|| • ||-B -1 || denote the condition number of B. Let e,e' > 0, be values such that e' < 1, and 

< e, and \\£\\ < 



e-cond(£) < e'/A. Let S G R nxn and 9 E W 1 , be such that M| < e, 
Then the system of linear equations (B + £)x = b + 9 has a unique solution x* such that: 



<e' 



We will apply this theorem using the l^ vector norm and induced matrix norm {maximum 
absolute row sum): Hx)^ := maxj |xj| and H-AH^ := maxj ^ ■ \a%j\. 



Let us define the matrices and vector in the statement of Theorem IB. 181 as follows: B := [L — P') 
and b := 6' 7 . Note that B = (I - P') is non-singular and B^ 1 = ^2^L (P') k , and x* = B~ l b is the 
unique solution to the linear system Bx = b, and that x* 7^ 0. 

Let us now give bounds for, \\B\\, ||-B _1 || and cond(£) := ||B|| ||-B _1 ||. (Note that we define 

Claim B.19. Let p be the smallest non-zero probability labeling any transition in M' . Then p < 
\\B\\ < 2 and < -4tt- Thus cond(B) < -M-r. 

II II II II ,p7l \ J pil -(-1 

Proof. B = (I — P') and P' is a substochastic matrix. Thus — P')\\ < 2. Furthermore, since 
every transient state Aj indexing rows and columns of P' has, by definition, non-zero-probability 
of reaching the absorbing state 7, we know that the probability of returning from any state Aj 
immediately back to itself is at most (1 —p), and thus for any j, (L — P')jj > p, and thus \\B\\ > p. 
Next, WB^W = || J2T=o( P ') k l and we established in Claim [537] that || J2kLo( P ') k \\ ^ -$TT- D 

We define e,e' > as follows: choose an arbitrary desired ef = 5' so that < e' < 1. Then let 

£ - t 

c 4*cond(,B) • 

Given the bounds on \\B\\, \\B~ ||, and cond(-B), in Claim IB". 191 we are able to choose a suitable 
5 with polynomial encoding size, namely 5 := £ * e/(4 * n' * cond(-B) * ||i3 _1 ||), and compute in 



1 Our statement is weaker, but is directly derivable from the cited Theorem. 
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P-time an approximation G B$(G^). Using Gg we can compute an approximation P' for the 
matrix P' (since the entries of P' are rule probabilities in G^ 3 \ except for those probabilities that 
are too low, < 5 * 2, which we can remove because we can detect them), and we can also compute 
an approximation b for the vector b = 6' 7 , such that, letting B := (J — P'), if we let £ := B — B, 
and we let (:=b-b, then ||| < e and |§| < e, and ||£ || < 1/||-B _:L ||. 

Thus all the conditions are in place to apply IB. 181 an d we have that the system Bx = b has a 
unique solution x* = B' b, and that 11 |r^*jj — < £ '- Since we know that < ||x*|| < 1, we have 

that - x*\\ <e' = 5'. 

We have thus established that we can approximate within a desired additive error 6' the proba- 
bilities q* A . of eventually reaching 7 from any nonterminal Aj via linear rules. In order to construct 

G^f) G i?,5'(G^ 4 ^) using this, and by adding suitable rules approximating Aj — ^> 7, we have to make 
a few relatively easy technical observations. Firstly, in our computations we have eliminated rules 
whose probability was "too low" to effect the overall probability of reaching 7 significantly. How- 
ever, our definition of Gg, requires that every rule that has positive probability in G^ should have 

positive probability in Gg, . This is easy to rectify: by the choice of 5 in our approximation Gg , 
we can, for all rules Aj — > 7 that should have positive probability, we can put in such rules with a 
small enough positive probability 5/2, so that it does not effect the overall probability of reaching 
any 7 substantially. This finally leads us to another point: we must make sure Gg/ is a proper 
SCFG. This we can do again, because, by choosing 5 to be suitably small, we can make sure that 
we can normalize the sum of weights on the approximated rules coming out of each nonterminal 
without changing any particular rule probability substantially. This completes the proof that we 
can compute Gf) G Bs>(G^) in P-time. □ 

Finally, we are ready to finish the proof of both Theorems IB . 21 and IB . 4\ both of which are direct 
corollaries of the following Lemma: 

Lemma B.20. Given any SCFG, G, with rational rule probabilities, given any rational value 5 > 
in standard binary representation, and given any natural number N specified in unary representa- 
tion, there is a polynomial time algorithm that computes an SCFG, G^ = G B$/(G^), for 
a suitably chosen 5' = f{\G\, N, 5) > 0, where f is some polynomial function, such that for all 
nonterminals A, and for all strings w G X*, such that \w\ < N, it holds that 

Moreover, given G, 5 > 0, and a string w of length at most N , we can compute in polynomial 
time (in the standard Turing model of computation) a value vg, w such that 

\PgW,w ~ v g,w\ < $ 

Proof. To prove this theorem, we will exploit the standard dynamic programming algorithm (a 
variant of Cocke-Kasami- Younger) for computing the "inside" probability, Pg,w, 

for an SCFG G 

that is already in Chomsky Normal Form. 

The algorithm was originally observed as part of the inside-outside algorithm by Baker [3] (see 
also [24J). It works in P-time in the unit-cost arithmetic RAM model of computation, meaning it 
uses a polynomial number of arithmetic {+, *} operations, and inductively computes the probabil- 
ities, qfj, that starting at the nonterminal A the SCFG generates the string of length j starting in 

position i of the string w. In other words, := Pg(A A Wj . . . Wi + j-\). 
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The induction in the dynamic program is on the length, j, of the string. The base case of the 
induction is easy: qf± is the probability p of the rule A A- Wi. If no such rule exists, then qf x := 0. 
Note that we can assume the CNF grammar and do do not have any e rules 
For the inductive step, assume we have already computed q^,, for all nonterminals A, all i, and 
all j' such that 1 < j' < j. Let the rules associated with A whose RHSs do not consist of just a 

P\ P'Z Pk 

terminal symbol be A — > X\Y\, A -4 X2Y2, . . ., A -4 X^Y^, where X^ and are nonterminals 
for all d = 1, . . . , k. (It may of course be the case that k = 0, i.e., that there are no such rules 
associated with A in this CNF grammar.) 

Then it is easy to check that the probability qfj can be computed inductively by the following 
arithmetic expression: 

k j—1 

itj = Yj Pd Yl %m<k+rn,j-m ( 18 ) 
d=l m=l 

Thus by induction, we can compute qf which is precisely the probability that the grammar G 
generates the string w starting with the start nonterminal S. In this way this algorithm computes 
Pg,w for SCFGs G that are already in CNF. 

It is important to point out two issues with the above algorithm. 

1. Firstly, even if we assume we are given as input a SCFG G which is already in CNF form, and 
where all of the rules have rational probabilities, the above inside algorithm, as described, only 
works in P-time in the unit cost arithmetic RAM model of computation, because although it 
only requires a polynomial number of arithmetic operations to compute qf n , since we require 
iterated multiplications, it means that in principle it is possible for the rational values qfj 
that we compute to blow up in encoding size, and in particular to require encoding size that 
is exponential in j, the length of the string being parsed. 

Thus, to carry out the inside algorithm in P-time in the standard Turing model of computa- 
tion, we need to show how we can approximate the output qfj in P-time in the Turing model. 
We shall show that indeed rounding the intermediate computed values to within a suitable 
polynomially many bits of precision suffices to achieve this, and thus that the approximate 
total probability parsing problem when the SCFGs are already in CNF form can be carried 
out in P-time. 

2. A second problem is that our original grammar G^ has irrational rule probabilities, so we had 
to approximate G^ with a suitable G^ = G^ ] . In this case the CKY dynamic programming 
method is being applied to rule probabilities that only approximate the "true" values of the 
rule probabilities. We show that with a good enough approximation this can not severely 
effect the overall probability that is computed. 

We will show that both of these issues can be addressed in the same way: by approximating 
the original rule probabilities to within sufficient accuracy requiring only polynomial computation, 
and then by using these inductively in the CKY dynamic programming algorithm, and rounding to 
within sufficiently accuracy after each step of the induction, and iterating the induction up to the 
string length value iV given in unary, we will be able to compute in P-time (in the standard Turing 

12 Also, we can of course compute/approximate the probability that a CNF grammar generates the empty string, e. 
The only possible e rule is S A e, and this can only appear if S is not on the RHS of any rule. Thus the probability 
of generating e from S is p (or if no such rule exists), and it is for all other nonterminals. 
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model of computation) an output value that is within desired accuracy of the "true" output value 
of this algorithm in the unit-cost RAM model on the original (irrational) SCFG , and thus we 
will have approximated the desired probabilities Pq w to within sufficient accuracy. 

The key observation for why the above approximations are possible is the following. Sup- 
pose that we are inductively attempting to approximate the value of qfj, having been given 5- 
approximations of all quantities on the RHS of the inductive equation: 

k j—1 

Qij ~ ^^,Pd 5-/ 1i,m,Qi+m,j -m 
d=l m=l 

First, observe that if we have (5-approximated the probabilities q i ^ l and q i ^ m j_ m with values 

v fm e 1] an d v i+mj-m e [0> respectively, then since all of these values are in [0, 1] we have 
that 
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Therefore, 

j'-i j'-i 

EX d Y d _ X d Y d | < - 2 r 

^i,m^t+rn,j—rn / j i,m i+m,j—m\ — J 
m=l m=l 

Finally, if we have 5-approximated all probabilities pd with G [0,1], in such a way that 
Yld=i v d < 1) then since we know that Yld=iPd < 1, we have that: 

\Pd Yl %™<k+™,i-m ~ v dYl V Zm v %m,j-m\ < J 26 + S J < 4 J 6 
m=l m=l 

Thus 

d=l m=l d=l m=l 

Thus, if the error accumulated at the previous iteration was 5, then the error accumulated at 
the next iteration is 4kj5. Inductively, since there are N iterations in total, we see that if m denotes 
the number of rules plus the number of distinct RHSs in the grammar then the total error 

after all N iterations, assuming the base case has been computed to within error S, is at most 

(4m 2 ) N 6 

Note that this error amounts to a loss of only polynomially many "bits of precision" in the 
input size. Note also that we can, after each iteration, round the values to within the desired 
polynomially many bits of precision, only accumulating negligible extra error, and still maintain 
the overall loss of only polynomially many "bits of precision" , even when all computations are on 
numbers of polynomial bit length. □ 
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